All Projects → adrg → strutil

adrg / strutil

Licence: MIT license
Golang metrics for calculating string similarity and other string utility functions

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to strutil

stringdistance
A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..
Stars: ✭ 60 (-47.37%)
Mutual labels:  levenshtein, jaro-winkler, jaccard-similarity, jaccard, string-similarity, hamming-distance, jaro, dice-coefficient
eddie
No description or website provided.
Stars: ✭ 18 (-84.21%)
Mutual labels:  levenshtein, jaro-winkler, string-similarity, jaro
stringosim
String similarity functions, String distance's, Jaccard, Levenshtein, Hamming, Jaro-Winkler, Q-grams, N-grams, LCS - Longest Common Subsequence, Cosine similarity...
Stars: ✭ 47 (-58.77%)
Mutual labels:  levenshtein, jaro-winkler, string-distance, jaccard
spark-stringmetric
Spark functions to run popular phonetic and string matching algorithms
Stars: ✭ 51 (-55.26%)
Mutual labels:  jaro-winkler, jaccard-similarity, hamming-distance
Levenshtein
The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity
Stars: ✭ 38 (-66.67%)
Mutual labels:  levenshtein, string-matching, string-similarity
edits.cr
Edit distance algorithms inc. Jaro, Damerau-Levenshtein, and Optimal Alignment
Stars: ✭ 16 (-85.96%)
Mutual labels:  levenshtein, jaro-winkler, jaro
strsim
string similarity based on Dice's coefficient in go
Stars: ✭ 39 (-65.79%)
Mutual labels:  string-matching, string-similarity, dice-coefficient
beda
Beda is a golang library for detecting how similar a two string
Stars: ✭ 34 (-70.18%)
Mutual labels:  string-distance, string-matching, string-similarity
stance
Learned string similarity for entity names using optimal transport.
Stars: ✭ 27 (-76.32%)
Mutual labels:  string-distance, string-matching, string-similarity
simetric
String similarity metrics for Elixir
Stars: ✭ 59 (-48.25%)
Mutual labels:  levenshtein, jaro-winkler
Jellyfish
🎐 a python library for doing approximate and phonetic matching of strings.
Stars: ✭ 1,571 (+1278.07%)
Mutual labels:  levenshtein, jaro-winkler
seqalign
Collection of sequence alignment algorithms.
Stars: ✭ 20 (-82.46%)
Mutual labels:  smith-waterman, string-distance
string-similarity-js
Lightweight string similarity function for javascript
Stars: ✭ 29 (-74.56%)
Mutual labels:  string, string-similarity
levenshtein finder
Similar string search in Levenshtein distance
Stars: ✭ 19 (-83.33%)
Mutual labels:  levenshtein, string-distance
ceja
PySpark phonetic and string matching algorithms
Stars: ✭ 24 (-78.95%)
Mutual labels:  jaro-winkler, hamming-distance
Textdistance
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.
Stars: ✭ 2,575 (+2158.77%)
Mutual labels:  levenshtein, hamming-distance
Java String Similarity
Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...
Stars: ✭ 2,403 (+2007.89%)
Mutual labels:  jaro-winkler, string-distance
fuzzywuzzy
Fuzzy string matching for PHP
Stars: ✭ 60 (-47.37%)
Mutual labels:  string-distance, string-matching
vbml
Way to check, match and resist.
Stars: ✭ 27 (-76.32%)
Mutual labels:  string, string-matching
levenshtein.c
Levenshtein algorithm in C
Stars: ✭ 77 (-32.46%)
Mutual labels:  levenshtein, string-matching

strutil

Build status Code coverage pkg.go.dev documentation MIT license Go report card GitHub issues Buy me a coffee

strutil provides a collection of string metrics for calculating string similarity as well as other string utility functions.
Full documentation can be found at https://pkg.go.dev/github.com/adrg/strutil.

Installation

go get github.com/adrg/strutil

String metrics

The package defines the StringMetric interface, which is implemented by all the string metrics. The interface is used with the Similarity function, which calculates the similarity between the specified strings, using the provided string metric.

type StringMetric interface {
    Compare(a, b string) float64
}

func Similarity(a, b string, metric StringMetric) float64 {
}

All defined string metrics can be found in the metrics package.

Hamming

Calculate similarity.

similarity := strutil.Similarity("text", "test", metrics.NewHamming())
fmt.Printf("%.2f\n", similarity) // Output: 0.75

Calculate distance.

ham := metrics.NewHamming()
fmt.Printf("%d\n", ham.Distance("one", "once")) // Output: 2

More information and additional examples can be found on pkg.go.dev.

Levenshtein

Calculate similarity using default options.

similarity := strutil.Similarity("graph", "giraffe", metrics.NewLevenshtein())
fmt.Printf("%.2f\n", similarity) // Output: 0.43

Configure edit operation costs.

lev := metrics.NewLevenshtein()
lev.CaseSensitive = false
lev.InsertCost = 1
lev.ReplaceCost = 2
lev.DeleteCost = 1

similarity := strutil.Similarity("make", "Cake", lev)
fmt.Printf("%.2f\n", similarity) // Output: 0.50

Calculate distance.

lev := metrics.NewLevenshtein()
fmt.Printf("%d\n", lev.Distance("graph", "giraffe")) // Output: 4

More information and additional examples can be found on pkg.go.dev.

Jaro

similarity := strutil.Similarity("think", "tank", metrics.NewJaro())
fmt.Printf("%.2f\n", similarity) // Output: 0.78

More information and additional examples can be found on pkg.go.dev.

Jaro-Winkler

similarity := strutil.Similarity("think", "tank", metrics.NewJaroWinkler())
fmt.Printf("%.2f\n", similarity) // Output: 0.80

More information and additional examples can be found on pkg.go.dev.

Smith-Waterman-Gotoh

Calculate similarity using default options.

swg := metrics.NewSmithWatermanGotoh()
similarity := strutil.Similarity("times roman", "times new roman", swg)
fmt.Printf("%.2f\n", similarity) // Output: 0.82

Customize gap penalty and substitution function.

swg := metrics.NewSmithWatermanGotoh()
swg.CaseSensitive = false
swg.GapPenalty = -0.1
swg.Substitution = metrics.MatchMismatch {
    Match:    1,
    Mismatch: -0.5,
}

similarity := strutil.Similarity("Times Roman", "times new roman", swg)
fmt.Printf("%.2f\n", similarity) // Output: 0.96

More information and additional examples can be found on pkg.go.dev.

Sorensen-Dice

Calculate similarity using default options.

sd := metrics.NewSorensenDice()
similarity := strutil.Similarity("time to make haste", "no time to waste", sd)
fmt.Printf("%.2f\n", similarity) // Output: 0.62

Customize n-gram size.

sd := metrics.NewSorensenDice()
sd.CaseSensitive = false
sd.NgramSize = 3

similarity := strutil.Similarity("Time to make haste", "no time to waste", sd)
fmt.Printf("%.2f\n", similarity) // Output: 0.53

More information and additional examples can be found on pkg.go.dev.

Jaccard

Calculate similarity using default options.

j := metrics.NewJaccard()
similarity := strutil.Similarity("time to make haste", "no time to waste", j)
fmt.Printf("%.2f\n", similarity) // Output: 0.45

Customize n-gram size.

j := metrics.NewJaccard()
j.CaseSensitive = false
j.NgramSize = 3

similarity := strutil.Similarity("Time to make haste", "no time to waste", j)
fmt.Printf("%.2f\n", similarity) // Output: 0.36

The input of the Sorensen-Dice example is the same as the one of Jaccard because the metrics bear a resemblance to each other. In fact, each of the coefficients can be used to calculate the other one.

Sorensen-Dice to Jaccard.

J = SD/(2-SD)

where SD is the Sorensen-Dice coefficient and J is the Jaccard index.

Jaccard to Sorensen-Dice.

SD = 2*J/(1+J)

where SD is the Sorensen-Dice coefficient and J is the Jaccard index.

More information and additional examples can be found on pkg.go.dev.

Overlap Coefficient

Calculate similarity using default options.

oc := metrics.NewOverlapCoefficient()
similarity := strutil.Similarity("time to make haste", "no time to waste", oc)
fmt.Printf("%.2f\n", similarity) // Output: 0.67

Customize n-gram size.

oc := metrics.NewOverlapCoefficient()
oc.CaseSensitive = false
oc.NgramSize = 3

similarity := strutil.Similarity("Time to make haste", "no time to waste", oc)
fmt.Printf("%.2f\n", similarity) // Output: 0.57

More information and additional examples can be found on pkg.go.dev.

References

For more information see:

Stargazers over time

Stargazers over time

Contributing

Contributions in the form of pull requests, issues or just general feedback, are always welcome.
See CONTRIBUTING.MD.

License

Copyright (c) 2019 Adrian-George Bostan.

This project is licensed under the MIT license. See LICENSE for more details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].