All Projects → pekoto → FastFuzzyStringMatcherDotNet

pekoto / FastFuzzyStringMatcherDotNet

Licence: MIT license
A BK tree implementation for fast fuzzy string matching

Programming Languages

C#
18002 projects

Projects that are alternatives of or similar to FastFuzzyStringMatcherDotNet

levenshtein.c
Levenshtein algorithm in C
Stars: ✭ 77 (+234.78%)
Mutual labels:  fuzzy-search, edit-distance, levenshtein-distance, string-matching
Symspell
SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
Stars: ✭ 1,976 (+8491.3%)
Mutual labels:  fuzzy-search, edit-distance, levenshtein-distance
LinSpell
Fast approximate strings search & spelling correction
Stars: ✭ 52 (+126.09%)
Mutual labels:  fuzzy-search, edit-distance, levenshtein-distance
Levenshtein
The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity
Stars: ✭ 38 (+65.22%)
Mutual labels:  levenshtein-distance, string-matching
Quickenshtein
Making the quickest and most memory efficient implementation of Levenshtein Distance with SIMD and Threading support
Stars: ✭ 204 (+786.96%)
Mutual labels:  edit-distance, levenshtein-distance
itunes-cli
Command line interface for control iTunes
Stars: ✭ 16 (-30.43%)
Mutual labels:  fuzzy-search
wildmatch
Simple string matching with questionmark- and star-wildcard operator
Stars: ✭ 37 (+60.87%)
Mutual labels:  string-matching
fuzzy-match
Library and command line utility to do approximate string matching of a source against a bitext index and get matched source and target.
Stars: ✭ 31 (+34.78%)
Mutual labels:  string-matching
string-similarity-js
Lightweight string similarity function for javascript
Stars: ✭ 29 (+26.09%)
Mutual labels:  fuzzy-search
fish-fzy
fzy inegration with fish. Search history, navigate directories and more. Blazingly fast.
Stars: ✭ 18 (-21.74%)
Mutual labels:  fuzzy-search
spellchecker-wasm
SpellcheckerWasm is an extrememly fast spellchecker for WebAssembly based on SymSpell
Stars: ✭ 46 (+100%)
Mutual labels:  levenshtein-distance
edits.cr
Edit distance algorithms inc. Jaro, Damerau-Levenshtein, and Optimal Alignment
Stars: ✭ 16 (-30.43%)
Mutual labels:  edit-distance
bolt.nvim
⚡ Ultrafast multi-pane file manager for Neovim with fuzzy matching
Stars: ✭ 100 (+334.78%)
Mutual labels:  fuzzy-search
algos
A collection of algorithms in rust
Stars: ✭ 16 (-30.43%)
Mutual labels:  string-matching
rosette-elasticsearch-plugin
Document Enrichment plugin for Elasticsearch
Stars: ✭ 25 (+8.7%)
Mutual labels:  fuzzy-search
strutil
Golang metrics for calculating string similarity and other string utility functions
Stars: ✭ 114 (+395.65%)
Mutual labels:  string-matching
TeamReference
Team reference for Competitive Programming. Algorithms implementations very used in the ACM-ICPC contests. Latex template to build your own team reference.
Stars: ✭ 29 (+26.09%)
Mutual labels:  string-matching
phonetic-algorithms
Phonetic-Algorithms for fuzzy searching | PHP
Stars: ✭ 14 (-39.13%)
Mutual labels:  fuzzy-search
dice-coefficient
Sørensen–Dice coefficient
Stars: ✭ 37 (+60.87%)
Mutual labels:  edit-distance
astarix
AStarix: Fast and Optimal Sequence-to-Graph Aligner
Stars: ✭ 60 (+160.87%)
Mutual labels:  edit-distance

FastFuzzyStringMatcher

FastFuzzyStringMatcher is a BK tree implementation for quick in-memory string matching. (Also available in Java).

Features

  • Fast, fuzzy, string matching.
  • Search based on percentage and edit distance.
  • Associate data with string keywords and return both. For example, search for a file name, and return associated file paths.

Motivation

Although hash maps can be used for exact string matching, and tries can be used for prefix matching, there are few solutions out there for fast matching based on edit distance or percentage difference. Of course, you can search through every string in a collection, comparing its edit distance to the keyword you're searching for, but this tends to be pretty inefficient.

FastFuzzyStringMatcher builds a BK tree to make searching a lot more efficient.

Setup

The project was built using Visual Studio 2017 and should build cleanly, assuming you have the latest .NET package installed.

There are three projects in the solution:

  • Example app (an example application showing how the StringMatcher can be used to search a translation memory dictionary)
  • FastFuzzyStringMatcher (the main code. StringMatcher.cs is the main class you want to look at if you're interested in the code)
  • FastFuzzyStringMatcherTests (unit tests)

NuGet

You can install the package using NuGet too: Install-Package FastFuzzyStringMatcher

(Warning: This currently also pulls in some .NET Standard Library .dll references since the project was built as a .NET Standard class library.)

Usage

Usage is fairly simple:

  1. Declare a new instance: StringMatcher<T> myStringMatcher = new StringMatcher<T>();
  2. Add your data by calling myStringMatcher.Add(...)
  3. Search for your data by calling myStringMatcher.Search(...)

EditDistanceCalculator.cs is also public, so it can also be used independently to calculate the edit distance between two String objects.

Running the tests

FastFuzzyStringMatcherTests contains unit tests which demonstrate and verify the functionality of the StringMatcher and EditDistanceCalculator classes.

These can be run in Visual Studio by right-clicking and selecting Run Tests, or from the Test Explorer.

Running the example

The ExampleApp project shows how the StringMatcher can be used to implement a translation memory dictionary with fuzzy matching. EnglishJapaneseDictionarySearcher.cs contains the implementation of the translation memory dictionary. MainForm.cs is a just a simple form that accepts a search term and minimum match percentage.

Note: This is a rough-and-ready example and does not include threading or error handling. In a real application, you want to do large file reads on a separate thread to keep the UI responsive.

The dictionary loads around 50,000 entries from the JMDict Project. StringMatcher should be able to handle a lot more than 50,000 translation pairs, but I wanted to keep the download size fairly small.

How Does It All Work?

1. Edit Distance

The fuzzy string matching relies on edit distance.

Edit distance, better known as Levenshtein distance, is the minimum number of edits it takes to turn one string into another, using substitution, insertion, and deletion.

For example, to turn cat into hate:

  1. cat > hat (substitute c for h)
  2. hat > hate (insert e)

Edit distance = 2

The algorithm to calculate this uses dynamic programming to build a matrix of edit distances. Wiki has a nice explanation and good examples.

EditDistanceCalculator.java uses the iterative with two matrix rows approach. This seems to give the best performance based on some quick tests I ran.

2. BK Tree

StringMatcher is essentially a BK tree implementation.

In a BK tree, every node is added based on its edit distance from the root.

For example, say we had this collection of words: hat, cat, kate, ball, and bat.

We start by adding hat. It becomes the root:

Next we add cat. This has an edit distance of 1 from hat (substitute h for c), so we add it as a child with a key of 1:

We do the same with kate and ball -- calculate their edit distances respective to the root, and then add them as children with those keys:

Finally we add bat. But notice that the edit distance is 1, and we already have a child with edit distance 1 -- cat. No problem. We just move down to cat, calculate the edit distance between cat and bat, and add the node as a child of cat.

Okay, now we're ready to search!

Imagine we accidentally typed in the word zat, and we want to get a list of potential corrections for our typo. Let's say we want to search all of our nodes with a maximum edit distance of 1.

First, we compare zat with our root, hat. Sure enough, the edit distance is 1, which is within our threshold, so we add hat to our list of results to return.

Next, we examine all of the child nodes within the current edit distance +/- our edit distance threshold of 1.

  • 1 (current edit distance) - 1 (our threshold) = 0 (min edit distance)
  • 1 (current edit distance) + 1 (our threshold) = 2 (max edit distance)

So we'll examine all of the children that have an edit distance between 0-2. That means we'll examine kate and cat, but ignore ball. First, let's check out kate.

Oh uh, the edit distance between zat and kate is 2, so we ignore this node, and there are no children, so let's back up.

cat has an edit distance of 1, so let's check it out. The edit distance between zat and cat is also 1, which is within our threshold, so hooray! We have another result.

Oh yeah, and cat has a child node. We repeat the step we did at the root but using our current node: work out the maximum and minimum threshold based on the edit distance between zat and cat, and then examine children within that threshold.

This brings us down to bat. We check the edit distance, and again find it's within our threshold.

With that we're done, and we return hat, cat, and bat. We can imagine any of these might be a typo for zat. If you wanted to predict which of the three words was most likely meant by the user, you could also take into account which keys are most commonly mistyped. For example, c is closest to z on a standard QWERTY keyboard, so you could guess that they probably meant cat.

Overall, we still ended up searching 80% of our tree, but 80% can still lead to a significant saving if you have, for example, 500,000 strings in your collection.

Exercise

What would happen if zap had been added to our BK tree?

Thoughts

The BK tree is a simple data structure that can deliver decent performance gains when you need to search a large number of strings. They're quick to implement and having fuzzy searching and custom spell checking can be a super nice feature for your application, especially when you're dealing with translation data or you have lots of custom strings that won't be picked up by a standard spell checker, like fund codes, stock ticker symbols, or fictional character names.

Happy searching :)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].