A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..

Stars: ✭ 60 (+275%)

Mutual labels: jaccard-similarity

minhash-lsh

Minhash LSH in Golang

Stars: ✭ 20 (+25%)

Mutual labels: lsh

Text-Analysis

Explaining textual analysis tools in Python. Including Preprocessing, Skip Gram (word2vec), and Topic Modelling.

Stars: ✭ 48 (+200%)

Mutual labels: textual-analysis

lshensemble

LSH index for approximate set containment search

Stars: ✭ 48 (+200%)

Mutual labels: lsh

lsh

Locality Sensitive Hashing for Go (Multi-probe LSH, LSH Forest, basic LSH)

Stars: ✭ 92 (+475%)

Mutual labels: lsh

bagminhash

BagMinHash - Minwise Hashing Algorithm for Weighted Sets

Stars: ✭ 24 (+50%)

Mutual labels: jaccard-similarity

product-quantization

🙃Implementation of vector quantization algorithms, codes for Norm-Explicit Quantization: Improving Vector Quantization for Maximum Inner Product Search.

Stars: ✭ 40 (+150%)

Mutual labels: lsh

View All Similar Projects ➔

Locality Sensitive Hashing for semantic similarity

vs 3.x

LSH (Locality Sensitive Hashing) is primarily used to find, given a large set of documents, the near-duplicates among them. It can use hamming distance, jaccard coefficient, edit distance or other distance notion.

You can read the following tutorials if you want to understand more about it:

Although LSH is more to duplicated documents than to semantic similar ones, in this approach I make an effort to use LSH to calculate semantic similarity among texts. For that, the algorithm extracts, using TFIDF, the text's main tokens (or you can pre-calculate them and pass as param). Also, in this approach I use MinHash (which uses Jaccard similarity) as the Similarity function.

The overall aim is to reduce the number of comparisons needed to find similar items. LSH uses hash collisions to capture objects similarities. The hash collisions come in handy here as similar documents have a high probability of having the same hash value. The probability of a hash collision for a minhash is exactly the Jaccard similarity of two sets.

See this tutorial to see how use this LSH!

Run as following to install dependencies:

  python3 -m pip install -r requirements.txt

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

italo-batista / lsh-semantic-similarity

Programming Languages

Labels

Projects that are alternatives of or similar to lsh-semantic-similarity

Locality Sensitive Hashing for semantic similarity

vs 3.x