All Projects → italo-batista → lsh-semantic-similarity

italo-batista / lsh-semantic-similarity

Licence: other
Locality Sensitive Hashing for semantic similarity (Python 3.x)

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to lsh-semantic-similarity

Datasketch
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble
Stars: ✭ 1,635 (+10118.75%)
Mutual labels:  lsh, jaccard-similarity
Dolphinn
High Dimensional Approximate Near(est) Neighbor
Stars: ✭ 32 (+100%)
Mutual labels:  lsh
Text-Similarity
A text similarity computation using minhashing and Jaccard distance on reuters dataset
Stars: ✭ 15 (-6.25%)
Mutual labels:  jaccard-similarity
lsh-rs
Locality Sensitive Hashing in Rust with Python bindings
Stars: ✭ 64 (+300%)
Mutual labels:  lsh
strutil
Golang metrics for calculating string similarity and other string utility functions
Stars: ✭ 114 (+612.5%)
Mutual labels:  jaccard-similarity
MoTIS
Mobile(iOS) Text-to-Image search powered by multimodal semantic representation models(e.g., OpenAI's CLIP). Accepted at NAACL 2022.
Stars: ✭ 60 (+275%)
Mutual labels:  lsh
Chinese financial sentiment dictionary
A Chinese financial sentiment word dictionary
Stars: ✭ 67 (+318.75%)
Mutual labels:  textual-analysis
wordhoard
This Python module can be used to obtain antonyms, synonyms, hypernyms, hyponyms, homophones and definitions.
Stars: ✭ 78 (+387.5%)
Mutual labels:  textual-analysis
H2 ALSH
Accurate and Fast ALSH for Maximum Inner Product Search (KDD 2018)
Stars: ✭ 18 (+12.5%)
Mutual labels:  lsh
Neural-Scam-Artist
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
Stars: ✭ 18 (+12.5%)
Mutual labels:  lsh
set-sketch-paper
SetSketch: Filling the Gap between MinHash and HyperLogLog
Stars: ✭ 23 (+43.75%)
Mutual labels:  jaccard-similarity
tika-similarity
Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.
Stars: ✭ 92 (+475%)
Mutual labels:  jaccard-similarity
image-ndd-lsh
Near-duplicate image detection using Locality Sensitive Hashing
Stars: ✭ 42 (+162.5%)
Mutual labels:  lsh
stringdistance
A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..
Stars: ✭ 60 (+275%)
Mutual labels:  jaccard-similarity
minhash-lsh
Minhash LSH in Golang
Stars: ✭ 20 (+25%)
Mutual labels:  lsh
Text-Analysis
Explaining textual analysis tools in Python. Including Preprocessing, Skip Gram (word2vec), and Topic Modelling.
Stars: ✭ 48 (+200%)
Mutual labels:  textual-analysis
lshensemble
LSH index for approximate set containment search
Stars: ✭ 48 (+200%)
Mutual labels:  lsh
lsh
Locality Sensitive Hashing for Go (Multi-probe LSH, LSH Forest, basic LSH)
Stars: ✭ 92 (+475%)
Mutual labels:  lsh
bagminhash
BagMinHash - Minwise Hashing Algorithm for Weighted Sets
Stars: ✭ 24 (+50%)
Mutual labels:  jaccard-similarity
product-quantization
🙃Implementation of vector quantization algorithms, codes for Norm-Explicit Quantization: Improving Vector Quantization for Maximum Inner Product Search.
Stars: ✭ 40 (+150%)
Mutual labels:  lsh

Locality Sensitive Hashing for semantic similarity

forthebadge vs 3.x

LSH (Locality Sensitive Hashing) is primarily used to find, given a large set of documents, the near-duplicates among them. It can use hamming distance, jaccard coefficient, edit distance or other distance notion.

You can read the following tutorials if you want to understand more about it:

Although LSH is more to duplicated documents than to semantic similar ones, in this approach I make an effort to use LSH to calculate semantic similarity among texts. For that, the algorithm extracts, using TFIDF, the text's main tokens (or you can pre-calculate them and pass as param). Also, in this approach I use MinHash (which uses Jaccard similarity) as the Similarity function.

The overall aim is to reduce the number of comparisons needed to find similar items. LSH uses hash collisions to capture objects similarities. The hash collisions come in handy here as similar documents have a high probability of having the same hash value. The probability of a hash collision for a minhash is exactly the Jaccard similarity of two sets.

See this tutorial to see how use this LSH!

Run as following to install dependencies:

  python3 -m pip install -r requirements.txt
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].