set-sketch-paperSetSketch: Filling the Gap between MinHash and HyperLogLog
Stars: ✭ 23 (-4.17%)
DatasketchMinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble
Stars: ✭ 1,635 (+6712.5%)
mkmhGenerate kmers/minimizers/hashes/MinHash signatures, including with multiple kmer sizes.
Stars: ✭ 21 (-12.5%)
spark-stringmetricSpark functions to run popular phonetic and string matching algorithms
Stars: ✭ 51 (+112.5%)
text-shinglesk-shingling for text to help compare similarity
Stars: ✭ 15 (-37.5%)
rkmhClassify sequencing reads using MinHash.
Stars: ✭ 42 (+75%)
tika-similarityTika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.
Stars: ✭ 92 (+283.33%)
Sampled-MinHashingA method to mine beyond-pairwise relationships using Min-Hashing for large-scale pattern discovery
Stars: ✭ 24 (+0%)
learning2hash.github.ioWebsite for "A survey of learning to hash for Computer Vision" https://learning2hash.github.io
Stars: ✭ 14 (-41.67%)
image-ndd-lshNear-duplicate image detection using Locality Sensitive Hashing
Stars: ✭ 42 (+75%)
strutilGolang metrics for calculating string similarity and other string utility functions
Stars: ✭ 114 (+375%)
tlshTLSH lib in Golang
Stars: ✭ 110 (+358.33%)
stringdistanceA fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..
Stars: ✭ 60 (+150%)
HyperMinHash-javaUnion, intersection, and set cardinality in loglog space
Stars: ✭ 48 (+100%)
Text-SimilarityA text similarity computation using minhashing and Jaccard distance on reuters dataset
Stars: ✭ 15 (-37.5%)
intertextDetect and visualize text reuse
Stars: ✭ 97 (+304.17%)
Neural-Scam-ArtistWeb Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
Stars: ✭ 18 (-25%)
ExpressionMatrix2Software for exploration of gene expression data from single-cell RNA sequencing.
Stars: ✭ 29 (+20.83%)
AnnoyApproximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk
Stars: ✭ 9,262 (+38491.67%)