All Projects → schollz → Closestmatch

schollz / Closestmatch

Licence: mit
Golang library for fuzzy matching within a set of strings 📃

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to Closestmatch

Fastenshtein
The fastest .Net Levenshtein around
Stars: ✭ 115 (-67.42%)
Mutual labels:  fuzzy-matching, levenshtein
Symspellpy
Python port of SymSpell
Stars: ✭ 420 (+18.98%)
Mutual labels:  fuzzy-matching, levenshtein
levenshtein.c
Levenshtein algorithm in C
Stars: ✭ 77 (-78.19%)
Mutual labels:  fuzzy-matching, levenshtein
Abydos
Abydos NLP/IR library for Python
Stars: ✭ 91 (-74.22%)
Mutual labels:  fuzzy-matching, levenshtein
Symspell
SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
Stars: ✭ 1,976 (+459.77%)
Mutual labels:  fuzzy-matching, levenshtein
Fuzzball.js
Easy to use and powerful fuzzy string matching, port of fuzzywuzzy.
Stars: ✭ 225 (-36.26%)
Mutual labels:  fuzzy-matching, levenshtein
stringdistance
A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..
Stars: ✭ 60 (-83%)
Mutual labels:  fuzzy-matching, levenshtein
Levenshtein
The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity
Stars: ✭ 38 (-89.24%)
Mutual labels:  levenshtein
RepostCheckerBot
Bot for checking reposts on reddit
Stars: ✭ 36 (-89.8%)
Mutual labels:  levenshtein
levenshtein-rs
Levenshtein algorithm in Rust
Stars: ✭ 37 (-89.52%)
Mutual labels:  levenshtein
java-sdk
一些常用的java sdk和工具类(日期工具类,分布式锁,redis缓存,二叉树,反射工具类,线程池,对称/非对称/分段加解密,json序列化,http工具,雪花算法,字符串相似度,集合操作工具,xml解析,重试Retry工具类,Jvm监控等)
Stars: ✭ 26 (-92.63%)
Mutual labels:  levenshtein
fish-fzy
fzy inegration with fish. Search history, navigate directories and more. Blazingly fast.
Stars: ✭ 18 (-94.9%)
Mutual labels:  fuzzy-matching
splink
Implementation of Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters
Stars: ✭ 181 (-48.73%)
Mutual labels:  fuzzy-matching
strutil
Golang metrics for calculating string similarity and other string utility functions
Stars: ✭ 114 (-67.71%)
Mutual labels:  levenshtein
Go Edlib
Golang string comparison and edit distance algorithms library, featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...
Stars: ✭ 253 (-28.33%)
Mutual labels:  levenshtein
edits.cr
Edit distance algorithms inc. Jaro, Damerau-Levenshtein, and Optimal Alignment
Stars: ✭ 16 (-95.47%)
Mutual labels:  levenshtein
Re Flex
The regex-centric, fast lexical analyzer generator for C++ with full Unicode support. Faster than Flex. Accepts Flex specifications. Generates reusable source code that is easy to understand. Introduces indent/dedent anchors, lazy quantifiers, functions for lex/syntax error reporting, and more. Seamlessly integrates with Bison and other parsers.
Stars: ✭ 274 (-22.38%)
Mutual labels:  fuzzy-matching
similar-english-words
Give me a word and I’ll give you an array of words that differ by a single letter.
Stars: ✭ 25 (-92.92%)
Mutual labels:  levenshtein
hubot-suggest
Suggest hubot commands when not found
Stars: ✭ 29 (-91.78%)
Mutual labels:  levenshtein
fuzzy-matcher
Fuzzy Matching Library for Rust
Stars: ✭ 140 (-60.34%)
Mutual labels:  fuzzy-matching

closestmatch 📃

Version Build Status Code Coverage GoDoc

closestmatch is a simple and fast Go library for fuzzy matching an input string to a list of target strings. closestmatch is useful for handling input from a user where the input (which could be mispelled or out of order) needs to match a key in a database. closestmatch uses a bag-of-words approach to precompute character n-grams to represent each possible target string. The closest matches have highest overlap between the sets of n-grams. The precomputation scales well and is much faster and more accurate than Levenshtein for long strings.

Getting Started

Install

go get -u -v github.com/schollz/closestmatch

Use

Create a closestmatch object from a list words

// Take a slice of keys, say band names that are similar
// http://www.tonedeaf.com.au/412720/38-bands-annoyingly-similar-names.htm
wordsToTest := []string{"King Gizzard", "The Lizard Wizard", "Lizzard Wizzard"}

// Choose a set of bag sizes, more is more accurate but slower
bagSizes := []int{2}

// Create a closestmatch object
cm := closestmatch.New(wordsToTest, bagSizes)

Find the closest match, or find the N closest matches

fmt.Println(cm.Closest("kind gizard"))
// returns 'King Gizzard'

fmt.Println(cm.ClosestN("kind gizard",3))
// returns [King Gizzard Lizzard Wizzard The Lizard Wizard]

Calculate the accuracy

// Calculate accuracy
fmt.Println(cm.AccuracyMutatingWords())
// ~ 66 % (still way better than Levenshtein which hits 0% with this particular set)

// Improve accuracy by adding more bags
bagSizes = []int{2, 3, 4}
cm = closestmatch.New(wordsToTest, bagSizes)
fmt.Println(cm.AccuracyMutatingWords())
// accuracy improves to ~ 76 %

Save/Load

// Save your current calculated bags
cm.Save("closestmatches.gob")

// Open it again
cm2, _ := closestmatch.Load("closestmatches.gob")
fmt.Println(cm2.Closest("lizard wizard"))
// prints "The Lizard Wizard"

Advantages

closestmatch is more accurate than Levenshtein for long strings (like in the test corpus).

closestmatch is ~20x faster than a fast implementation of Levenshtein. Try it yourself with the benchmarks:

cd $GOPATH/src/github.com/schollz/closestmatch && go test -run=None -bench=. > closestmatch.bench
cd $GOPATH/src/github.com/schollz/closestmatch/levenshtein && go test -run=None -bench=. > levenshtein.bench
benchcmp levenshtein.bench ../closestmatch.bench

which gives the following benchmark (on Intel i7-3770 CPU @ 3.40GHz w/ 8 processors):

benchmark                 old ns/op     new ns/op     delta
BenchmarkNew-8            1.47          1933870       +131555682.31%
BenchmarkClosestOne-8     104603530     4855916       -95.36%

The New() function in closestmatch is so slower than levenshtein because there is precomputation needed.

Disadvantages

closestmatch does worse for matching lists of single words, like a dictionary. For comparison:

$ cd $GOPATH/src/github.com/schollz/closestmatch && go test
Accuracy with mutating words in book list:      90.0%
Accuracy with mutating letters in book list:    100.0%
Accuracy with mutating letters in dictionary:   38.9%

while levenshtein performs slightly better for a single-word dictionary (but worse for longer names, like book titles):

$ cd $GOPATH/src/github.com/schollz/closestmatch/levenshtein && go test
Accuracy with mutating words in book list:      40.0%
Accuracy with mutating letters in book list:    100.0%
Accuracy with mutating letters in dictionary:   64.8%

License

MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].