All Projects → steven-s → text-shingles

steven-s / text-shingles

Licence: MIT license
k-shingling for text to help compare similarity

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to text-shingles

rkmh
Classify sequencing reads using MinHash.
Stars: ✭ 42 (+180%)
Mutual labels:  minhash
Sampled-MinHashing
A method to mine beyond-pairwise relationships using Min-Hashing for large-scale pattern discovery
Stars: ✭ 24 (+60%)
Mutual labels:  minhash
mkmh
Generate kmers/minimizers/hashes/MinHash signatures, including with multiple kmer sizes.
Stars: ✭ 21 (+40%)
Mutual labels:  minhash
HyperMinHash-java
Union, intersection, and set cardinality in loglog space
Stars: ✭ 48 (+220%)
Mutual labels:  minhash
intertext
Detect and visualize text reuse
Stars: ✭ 97 (+546.67%)
Mutual labels:  minhash
Neural-Scam-Artist
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
Stars: ✭ 18 (+20%)
Mutual labels:  minhash
Datasketch
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble
Stars: ✭ 1,635 (+10800%)
Mutual labels:  minhash
set-sketch-paper
SetSketch: Filling the Gap between MinHash and HyperLogLog
Stars: ✭ 23 (+53.33%)
Mutual labels:  minhash
bagminhash
BagMinHash - Minwise Hashing Algorithm for Weighted Sets
Stars: ✭ 24 (+60%)
Mutual labels:  minhash
catch-me-if-you-can
plagiarism detector
Stars: ✭ 16 (+6.67%)
Mutual labels:  minhash
minhash-lsh
Minhash LSH in Golang
Stars: ✭ 20 (+33.33%)
Mutual labels:  minhash

Text Shingles Library

This is python 3 library to support measuring the similarity of pieces of text based on their MinHash signature generated from their k-shingle form.

API

Text can be represented in MinHash form by creating a new ShingledText instance and passing in text as well as optional values for the random_seed for hashing (default 5), the shingle_length aka the k in k-shingles (default 5), and the minhash_size for the size of the MinHash signature (default 200). Variables for the list form of the minhash and iterator representation of shingles are available for the object. A similarity function is also available to compute the Jaccard similarity of the two MinHash objects.

Requirements

This library utilizes Python 3, NLTK, and Murmur Hash

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].