All Projects → dexyk → stringosim

dexyk / stringosim

Licence: MIT license
String similarity functions, String distance's, Jaccard, Levenshtein, Hamming, Jaro-Winkler, Q-grams, N-grams, LCS - Longest Common Subsequence, Cosine similarity...

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to stringosim

strutil
Golang metrics for calculating string similarity and other string utility functions
Stars: ✭ 114 (+142.55%)
Mutual labels:  levenshtein, jaro-winkler, string-distance, jaccard
stringdistance
A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..
Stars: ✭ 60 (+27.66%)
Mutual labels:  levenshtein, jaro-winkler, jaccard, jaro-distance
simetric
String similarity metrics for Elixir
Stars: ✭ 59 (+25.53%)
Mutual labels:  distance, levenshtein, jaro-winkler
Java String Similarity
Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...
Stars: ✭ 2,403 (+5012.77%)
Mutual labels:  distance, jaro-winkler, string-distance
similar-english-words
Give me a word and I’ll give you an array of words that differ by a single letter.
Stars: ✭ 25 (-46.81%)
Mutual labels:  distance, levenshtein
Multi-Face-Comparison
This repo is meant for backend API for face comparision and computer vision. It is built on python flask framework
Stars: ✭ 20 (-57.45%)
Mutual labels:  distance, comparison
Stringmetric
🎯 String metrics and phonetic algorithms for Scala (e.g. Dice/Sorensen, Hamming, Jaccard, Jaro, Jaro-Winkler, Levenshtein, Metaphone, N-Gram, NYSIIS, Overlap, Ratcliff/Obershelp, Refined NYSIIS, Refined Soundex, Soundex, Weighted Levenshtein).
Stars: ✭ 481 (+923.4%)
Mutual labels:  distance, levenshtein
Stopwords
Removes most frequent words (stop words) from a text content. Based on a Curated list of language statistics.
Stars: ✭ 83 (+76.6%)
Mutual labels:  distance, levenshtein
Textdistance
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.
Stars: ✭ 2,575 (+5378.72%)
Mutual labels:  distance, levenshtein
edits.cr
Edit distance algorithms inc. Jaro, Damerau-Levenshtein, and Optimal Alignment
Stars: ✭ 16 (-65.96%)
Mutual labels:  levenshtein, jaro-winkler
levenshtein-edit-distance
Levenshtein edit distance
Stars: ✭ 59 (+25.53%)
Mutual labels:  distance, levenshtein
eddie
No description or website provided.
Stars: ✭ 18 (-61.7%)
Mutual labels:  levenshtein, jaro-winkler
Jellyfish
🎐 a python library for doing approximate and phonetic matching of strings.
Stars: ✭ 1,571 (+3242.55%)
Mutual labels:  levenshtein, jaro-winkler
spark-stringmetric
Spark functions to run popular phonetic and string matching algorithms
Stars: ✭ 51 (+8.51%)
Mutual labels:  jaro-winkler, cosine-distance
Deepdiff
Deep Difference and search of any Python object/data.
Stars: ✭ 985 (+1995.74%)
Mutual labels:  distance, comparison
levenshtein finder
Similar string search in Levenshtein distance
Stars: ✭ 19 (-59.57%)
Mutual labels:  levenshtein, string-distance
Quickenshtein
Making the quickest and most memory efficient implementation of Levenshtein Distance with SIMD and Threading support
Stars: ✭ 204 (+334.04%)
Mutual labels:  levenshtein, string-distance
bliss
Bliss music library that can compute distance between songs
Stars: ✭ 76 (+61.7%)
Mutual labels:  distance
hyperdiff
Find common, removed and added element between two collections.
Stars: ✭ 14 (-70.21%)
Mutual labels:  comparison
geodist
Golang package to compute the distance between two geographic latitude, longitude coordinates
Stars: ✭ 133 (+182.98%)
Mutual labels:  distance

stringosim

The plan for this package is to have Go implementation of different string distance/similarity functions, like Levenshtein (normalized, weighted, Damerau), Jaro-Winkler, Jaccard index, Euclidean distance, Hamming distance...

Currently it has implemented:

  • Levenshtein
  • Jaccard
  • Hamming
  • LCS
  • Q-gram
  • n-gram based Cosine distanc

Work in progress...

Import and installation

To get the library just run:

    go get github.com/dexyk/stringosim

To use the library just import it in your code:

    import "github.com/dexyk/stringosim"

To run the tests, go to the directory where stringosim package is installed and run:

    go test

Usage

Currently only Levenshtein, Jaccard, Hamming, LCS string, Q-gram and Cosine distances are implemented.

Levenshtein

Levenshtein distance can be calculated with default parameters (use DefaultSimilarityOptions) where cost of insert, delete and substitute operation are 1. You can also use it with other parameters by using SimilarityOptions type. Setting CaseInsensitive to true in SimilarityOptions the comparison will be done without considering character cases.

Example:

    fmt.Println(stringosim.Levenshtein([]rune("stringosim"), []rune("stingobim")))

    fmt.Println(stringosim.Levenshtein([]rune("stringosim"), []rune("stingobim"),
    stringosim.LevenshteinSimilarityOptions{
        InsertCost:     3,
        DeleteCost:     5,
        SubstituteCost: 2,
    }))

    fmt.Println(stringosim.Levenshtein([]rune("stringosim"), []rune("STRINGOSIM"),
    stringosim.LevenshteinSimilarityOptions{
        InsertCost:      3,
        DeleteCost:      4,
        SubstituteCost:  5,
        CaseInsensitive: true,
    }))

Jaccard

Jaccard distance can be calculated by setting the size of the n-gram which will be used for comparison. If the size is omitted the default value of 1 will be used.

Example:

    fmt.Println(stringosim.Jaccard([]rune("stringosim"), []rune("stingobim")))

    fmt.Println(stringosim.Jaccard([]rune("stringosim"), []rune("stingobim"), []int{2}))

    fmt.Println(stringosim.Jaccard([]rune("stringosim"), []rune("stingobim"), []int{3}))

Hamming

Hamming distance can be calculated with options. Default function will calculate standard hamming distance with case sensitive option. It can be also used without case sensitive option.

If the strings to compare have different lengths, the error will be returned.

Example:

    dis, _ := stringosim.Hamming([]rune("testing"), []rune("restink"))
    fmt.Println(dis)

    dis, _ = stringosim.Hamming([]rune("testing"), []rune("FESTING"), stringosim.HammingSimilarityOptions{
        CaseInsensitive: true,
    })
    fmt.Println(dis)

    _, err := stringosim.Hamming([]rune("testing"), []rune("testin"))
    fmt.Println(err)

Longest Common Subsequence (LCS)

LCS between two strings can be calculated with options. Default function will calculate the LCS with case insensitive option. It can be also used without case sensitive option.

Example:

    fmt.Println(stringosim.LCS([]rune("testing lcs algorithm"), []rune("another l c s example")))

    fmt.Println(stringosim.LCS([]rune("testing lcs algorithm"), []rune("ANOTHER L C S EXAMPLE"),
    stringosim.LCSSimilarityOptions{
        CaseInsensitive: true,
    }))

Jaro and Jaro-Winkler

Jaro and Jaro-Winkler can be calculated with options: case insensitive, and specific values for Jaro-Winkler - threshold, p value and l value.

Example:

    fmt.Println(stringosim.Jaro([]rune("abaccbabaacbcb"), []rune("bababbcabbaaca")))
    fmt.Println(stringosim.Jaro([]rune("abaccbabaacbcb"), []rune("ABABAbbCABbaACA"),
    stringosim.JaroSimilarityOptions{
        CaseInsensitive: true,
    }))

    fmt.Println(stringosim.JaroWinkler([]rune("abaccbabaacbcb"), []rune("bababbcabbaaca")))
    fmt.Println(stringosim.JaroWinkler([]rune("abaccbabaacbcb"), []rune("BABAbbCABbaACA"),
    stringosim.JaroSimilarityOptions{
        CaseInsensitive: true,
        Threshold:       0.7,
        PValue:          0.1,
        LValue:          4,
    }))

Q-gram

Q-gram distance can be calculated using default options (DefaultQGramOptions): length of q-grams is 2 and comparison is case sensitive. Using QGramSimilarityOptions as the parameter of the function we can set custom q-gram length and if the comparison is case sensitive or not.

Example:

    fmt.Println(stringosim.QGram([]rune("abcde"), []rune("abdcde")))

    fmt.Println(stringosim.QGram([]rune("abcde"), []rune("ABDCDE"),
    stringosim.QGramSimilarityOptions{
        CaseInsensitive: true,
        NGramSizes:     []int{3},
    }))

Cosine

Cosine distance can be calculated using default options (DefaultCosineOptions): length of n-grams is 2 and comparison is case sensitive. Using CosineSimilarityOptions as the parameter of the function we can set custom n-gram length and if the comparison is case sensitive or not.

Example:

    fmt.Println(stringosim.Cosine([]rune("abcde"), []rune("abdcde")))

    fmt.Println(stringosim.Cosine(Cosine[]rune("abcde"), []rune("ABDCDE"),
    stringosim.CosineSimilarityOptions{
        CaseInsensitive: true,
        NGramSizes:     []int{3},
    }))
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].