All Projects → vickumar1981 → stringdistance

vickumar1981 / stringdistance

Licence: other
A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..

Programming Languages

scala
5932 projects
java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to stringdistance

strutil
Golang metrics for calculating string similarity and other string utility functions
Stars: ✭ 114 (+90%)
Mutual labels:  levenshtein, jaro-winkler, jaccard-similarity, jaccard, string-similarity, hamming-distance, jaro, dice-coefficient
stringosim
String similarity functions, String distance's, Jaccard, Levenshtein, Hamming, Jaro-Winkler, Q-grams, N-grams, LCS - Longest Common Subsequence, Cosine similarity...
Stars: ✭ 47 (-21.67%)
Mutual labels:  levenshtein, jaro-winkler, jaccard, jaro-distance
eddie
No description or website provided.
Stars: ✭ 18 (-70%)
Mutual labels:  levenshtein, jaro-winkler, string-similarity, jaro
edits.cr
Edit distance algorithms inc. Jaro, Damerau-Levenshtein, and Optimal Alignment
Stars: ✭ 16 (-73.33%)
Mutual labels:  levenshtein, jaro-winkler, jaro
Java String Similarity
Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...
Stars: ✭ 2,403 (+3905%)
Mutual labels:  jaro-winkler, levenshtein-distance, cosine-similarity
Levenshtein
The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity
Stars: ✭ 38 (-36.67%)
Mutual labels:  levenshtein, levenshtein-distance, string-similarity
Jellyfish
🎐 a python library for doing approximate and phonetic matching of strings.
Stars: ✭ 1,571 (+2518.33%)
Mutual labels:  levenshtein, jaro-winkler, soundex
Symspell
SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
Stars: ✭ 1,976 (+3193.33%)
Mutual labels:  fuzzy-matching, levenshtein, levenshtein-distance
Textdistance
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.
Stars: ✭ 2,575 (+4191.67%)
Mutual labels:  levenshtein, levenshtein-distance, hamming-distance
set-sketch-paper
SetSketch: Filling the Gap between MinHash and HyperLogLog
Stars: ✭ 23 (-61.67%)
Mutual labels:  cosine-similarity, jaccard-similarity, jaccard
spark-stringmetric
Spark functions to run popular phonetic and string matching algorithms
Stars: ✭ 51 (-15%)
Mutual labels:  jaro-winkler, jaccard-similarity, hamming-distance
levenshtein.c
Levenshtein algorithm in C
Stars: ✭ 77 (+28.33%)
Mutual labels:  fuzzy-matching, levenshtein, levenshtein-distance
LinSpell
Fast approximate strings search & spelling correction
Stars: ✭ 52 (-13.33%)
Mutual labels:  levenshtein, levenshtein-distance
spellchecker-wasm
SpellcheckerWasm is an extrememly fast spellchecker for WebAssembly based on SymSpell
Stars: ✭ 46 (-23.33%)
Mutual labels:  levenshtein, levenshtein-distance
simetric
String similarity metrics for Elixir
Stars: ✭ 59 (-1.67%)
Mutual labels:  levenshtein, jaro-winkler
ceja
PySpark phonetic and string matching algorithms
Stars: ✭ 24 (-60%)
Mutual labels:  jaro-winkler, hamming-distance
Closestmatch
Golang library for fuzzy matching within a set of strings 📃
Stars: ✭ 353 (+488.33%)
Mutual labels:  fuzzy-matching, levenshtein
Symspellpy
Python port of SymSpell
Stars: ✭ 420 (+600%)
Mutual labels:  fuzzy-matching, levenshtein
Fastenshtein
The fastest .Net Levenshtein around
Stars: ✭ 115 (+91.67%)
Mutual labels:  fuzzy-matching, levenshtein
Quickenshtein
Making the quickest and most memory efficient implementation of Levenshtein Distance with SIMD and Threading support
Stars: ✭ 204 (+240%)
Mutual labels:  levenshtein, levenshtein-distance

Logo

StringDistance

Build Status Coverage Status Read the Docs Maven metadata URI License

A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more.

Works with generalized arrays.

For more detailed information, please refer to the API Documentation.

Requires: Java 8+ or Scala 2.11+


Contents

  1. Add it to your project
  2. Using in Scala
  3. Using in Scala with implicits
  4. Using in Java
  5. Using with Arrays
  6. Adding your own algorithm
  7. Reporting an Issue
  8. Contributing
  9. License

1. Add it to your project ...

Using sbt:

In build.sbt:

libraryDependencies += "com.github.vickumar1981" %% "stringdistance" % "1.2.7"

Using gradle:

In build.gradle:

dependencies {
    compile 'com.github.vickumar1981:stringdistance_2.13:1.2.7'
}

Using Maven:

In pom.xml:

<dependency>
    <groupId>com.github.vickumar1981</groupId>
    <artifactId>stringdistance_2.13</artifactId>
    <version>1.2.7</version>
</dependency>

Notes:

  • For Scala 2.12, please use the stringdistance_2.12 artifact as a dependency instead.
  • For Scala 2.11, please use the stringdistance_2.11 artifact as a dependency instead.

2. Scala Usage

Example.scala:

// Scala example
import com.github.vickumar1981.stringdistance.StringDistance._
import com.github.vickumar1981.stringdistance.StringSound._
import com.github.vickumar1981.stringdistance.impl.{ConstantGap, LinearGap}

// Cosine Similarity
val cosSimilarity: Double = Cosine.score("hello", "chello")  // 0.935

// Damerau-Levenshtein Distance
val damerauDist: Int = Damerau.distance("martha", "marhta")  // 1
val damerau: Double = Damerau.score("martha", "marhta")  // 0.833

// Dice Coefficient
val diceCoefficient: Double = DiceCoefficient.score("martha", "marhta")  // 0.4

// Hamming Distance
val hammingDist: Int = Hamming.distance("martha", "marhta")  // 2
val hamming: Double = Hamming.score("martha", "marhta")  // 0.667

// Jaccard Similarity
val jaccard: Double = Jaccard.score("karolin", "kathrin", 1)

// Jaro and Jaro Winkler
val jaro: Double = Jaro.score("martha", "marhta")  // 0.944
val jaroWinkler: Double = JaroWinkler.score("martha", "marhta", 0.1)  // 0.961

// Levenshtein Distance
val levenshteinDist: Int = Levenshtein.distance("martha", "marhta")  // 2
val levenshtein: Double = Levenshtein.score("martha", "marhta")  // 0.667

// Longest Common Subsequence
val longestCommonSubSeq: Int = LongestCommonSeq.distance("martha", "marhta")  // 5

// Needleman Wunsch
val needlemanWunsch: Double = NeedlemanWunsch.score("martha", "marhta", ConstantGap())  // 0.667

// N-Gram Similarity and Distance
val ngramDist: Int = NGram.distance("karolin", "kathrin", 1)  // 5
val bigramDist: Int = NGram.distance("karolin", "kathrin", 2)  // 2
val ngramSimilarity: Double = NGram.score("karolin", "kathrin", 1)  // 0.714
val bigramSimilarity: Double = NGram.score("karolin", "kathrin", 2)  // 0.333

// N-Gram tokens, returns a List[String]
val tokens: List[String] = NGram.tokens("martha", 2)  // List("ma", "ar", "rt", "th", "ha")

// Overlap Similarity
val overlap: Double = Overlap.score("karolin", "kathrin", 1)  // 0.286
val overlapBiGram: Double = Overlap.score("karolin", "kathrin", 2)  // 0.667

// Smith Waterman Similarities
val smithWaterman: Double = SmithWaterman.score("martha", "marhta", (LinearGap(gapValue = -1), Integer.MAX_VALUE))
val smithWatermanGotoh: Double = SmithWatermanGotoh.score("martha", "marhta", ConstantGap())

// Tversky Similarity
val tversky: Double = Tversky.score("karolin", "kathrin", 0.5)  // 0.333

// Phonetic Similarity
val metaphone: Boolean = Metaphone.score("merci", "mercy")  // true
val soundex: Boolean = Soundex.score("merci", "mercy")  // true

3. Scala: Use with Implicits

  • To use implicits and extend the String class: import com.github.vickumar1981.stringdistance.StringConverter._

Example.scala

// Scala example using implicits
import com.github.vickumar1981.stringdistance.StringConverter._

// Scores between two strings
val cosSimilarity: Double = "hello".cosine("chello")
val damerau: Double = "martha".damerau("marhta")
val diceCoefficient: Double = "martha".diceCoefficient("marhta")
val hamming: Double = "martha".hamming("marhta")
val jaccard: Double = "karolin".jaccard("kathrin")
val jaro: Double = "martha".jaro("marhta")
val jaroWinkler: Double = "martha".jaroWinkler("marhta")
val levenshtein: Double = "martha".levenshtein("marhta")
val needlemanWunsch: Double = "martha".needlemanWunsch("marhta")
val ngramSimilarity: Double = "karolin".nGram("kathrin")
val bigramSimilarity: Double = "karolin".nGram("kathrin", 2)
val overlap: Double = "karolin".overlap("kathrin")
val overlapBiGram: Double = "karolin".overlap("kathrin", 2)
val smithWaterman: Double = "martha".smithWaterman("marhta")
val smithWatermanGotoh: Double = "martha".smithWatermanGotoh("marhta")
val tversky: Double = "karolin".tversky("kathrin", 0.5)

// Distances between two strings
val damerauDist: Int = "martha".damerauDist("marhta")  // 1
val hammingDist: Int = "martha".hammingDist("marhta")
val levenshteinDist: Int = "martha".levenshteinDist("marhta")
val longestCommonSeq: Int = "martha".longestCommonSeq("marhta")
val ngramDist: Int = "karolin".nGramDist("kathrin")
val bigramDist: Int = "karolin".nGramDist("kathrin", 2)

// N-Gram tokens, returns a List[String]
val tokens: List[String] = "martha".tokens(2)  // List("ma", "ar", "rt", "th", "ha")

// Phonetic similarity of two strings
val metaphone: Boolean = "merci".metaphone("mercy")
val soundex: Boolean = "merci".soundex("mercy")

4. Java Usage

  • To use in Java: import com.github.vickumar1981.stringdistance.util.StringDistance

Example.java

// Java example
import com.github.vickumar1981.stringdistance.util.StringDistance;
import com.github.vickumar1981.stringdistance.util.StringSound;

// Scores between two strings
Double cosSimilarity = StringDistance.cosine("hello", "chello");
Double damerau = StringDistance.damerau("martha", "marhta");
Double diceCoefficient = StringDistance.diceCoefficient("martha", "marhta");
Double hamming = StringDistance.hamming("martha", "marhta");
Double jaccard = StringDistance.jaccard("karolin", "kathrin");
Double jaro = StringDistance.jaro("martha", "marhta");
Double jaroWinkler = StringDistance.jaroWinkler("martha", "marhta");
Double levenshtein = StringDistance.levenshtein("martha", "marhta");
Double needlemanWunsch = StringDistance.needlemanWunsch("martha", "marhta");
Double ngramSimilarity = StringDistance.nGram("karolin", "kathrin");
Double bigramSimilarity = StringDistance.nGram("karolin", "kathrin", 2);
Double overlap = StringDistance.overlap("karolin", "kathrin");
Double overlapBiGram = StringDistance.overlap("karolin", "kathrin", 2);
Double smithWaterman = StringDistance.smithWaterman("martha", "marhta");
Double smithWatermanGotoh = StringDistance.smithWatermanGotoh("martha", "marhta");
Double tversky = StringDistance.tversky("karolin", "kathrin", 0.5);

// Distances between two strings
Integer damerauDist = StringDistance.damerauDist("martha", "marhta");
Integer hammingDist = StringDistance.hammingDist("martha", "marhta");
Integer levenshteinDist = StringDistance.levenshteinDist("martha", "marhta");
Integer longestCommonSeq = StringDistance.longestCommonSeq("martha", "marhta");
Integer ngramDist = StringDistance.nGramDist("karolin", "kathrin");
Integer bigramDist = StringDistance.nGramDist("karolin", "kathrin", 2);

// N-Gram tokens, returns a List<String>
List<String> tokens = StringDistance.nGramTokens(2)  // List("ma", "ar", "rt", "th", "ha")

// Phonetic similarity of two strings
Boolean metaphone = StringSound.metaphone("merci", "mercy");
Boolean soundex = StringSound.soundex("merci", "mercy");

5. Using with Arrays

  • You can use the ArrayDistance class just like the StringDistance class, except using a generic array - Array[T] for Scala and T[] for Java.

  • Make sure your classes are comparable using == for Scala or .equals for Java

Scala Sample Code:

import com.github.vickumar1981.stringdistance.ArrayDistance._

// Example Levenshtein Distance and Score
val levenshteinDist = Levenshtein.distance(Array("m", "a", "r", "t", "h", "a"), Array("m", "a", "r", "h", "t", "a")) // 2
val levenshtein = Levenshtein.score(Array("m", "a", "r", "t", "h", "a"), Array("m", "a", "r", "h", "t", "a")) // 0.667

Java Example Code:


6. Adding your own Distance or Scoring Algorithm

  1. Create a marker trait that extends StringMetricAlgorithm:
trait CustomAlgorithm extends StringMetricAlgorithm
  1. Create an implementation for that algorithm using an implicit object. Override either the score or the distance method, depending upon whether the object extends DistanceAlgorithm or ScoringAlgorithm.
implicit object CustomDistance extends DistanceAlgorithm[CustomAlgorithm] {
    override def distance(s1: String, s2: String): Int = {
        // Implement distance between s1 and s2
    }
}

implicit object CustomScore extends ScoringAlgorithm[CustomAlgorithm] {
    override def score(s1: String, s2: String): Double = {
        // Implement fuzzy score between s1 and s2
    }
}
  1. Create an object that extends StringMetric using your algorithm as the type parameter, and use the score and distance methods defined in the implicit object.
object CustomMetric extends StringMetric[CustomAlgorithm]

val customScore: Double = CustomMetric.score("hello", "hello2")
val customDist: Int = CustomMetric.distance("hello", "hello2")

7. Reporting an Issue

Please report any issues or bugs to the Github issues page.


8. Contributing

Please view the contributing guidelines


9. License

This project is licensed under the Apache 2 License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].