All Projects → mattilyra → Lsh

mattilyra / Lsh

Licence: mit
Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents

Programming Languages

python
139335 projects - #7 most used programming language
cython
566 projects

Projects that are alternatives of or similar to Lsh

Libpostal
A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
Stars: ✭ 3,312 (+1719.78%)
Mutual labels:  deduplication
Dupandas
📊 python package for performing deduplication using flexible text matching and cleaning in pandas dataframe
Stars: ✭ 20 (-89.01%)
Mutual labels:  deduplication
Vdo
Userspace tools for managing VDO volumes.
Stars: ✭ 138 (-24.18%)
Mutual labels:  deduplication
Kopia
Cross-platform backup tool for Windows, macOS & Linux with fast, incremental backups, client-side end-to-end encryption, compression and data deduplication. CLI and GUI included.
Stars: ✭ 507 (+178.57%)
Mutual labels:  deduplication
Jdupes
A powerful duplicate file finder and an enhanced fork of 'fdupes'.
Stars: ✭ 790 (+334.07%)
Mutual labels:  deduplication
Rmlint
Extremely fast tool to remove duplicates and other lint from your filesystem
Stars: ✭ 996 (+447.25%)
Mutual labels:  deduplication
UMICollapse
Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
Stars: ✭ 31 (-82.97%)
Mutual labels:  deduplication
Kvdo
A pair of kernel modules which provide pools of deduplicated and/or compressed block storage.
Stars: ✭ 168 (-7.69%)
Mutual labels:  deduplication
Borgmatic
Simple, configuration-driven backup software for servers and workstations
Stars: ✭ 902 (+395.6%)
Mutual labels:  deduplication
Spark Lucenerdd
Spark RDD with Lucene's query and entity linkage capabilities
Stars: ✭ 114 (-37.36%)
Mutual labels:  deduplication
Recordlinkage
A toolkit for record linkage and duplicate detection in Python
Stars: ✭ 532 (+192.31%)
Mutual labels:  deduplication
Rdedup
Data deduplication engine, supporting optional compression and public key encryption.
Stars: ✭ 690 (+279.12%)
Mutual labels:  deduplication
Rltk
Record Linkage ToolKit (Find and link entities)
Stars: ✭ 71 (-60.99%)
Mutual labels:  deduplication
Alertmanager
Prometheus Alertmanager
Stars: ✭ 4,574 (+2413.19%)
Mutual labels:  deduplication
Dejavu
Quickly detect already witnessed data.
Stars: ✭ 151 (-17.03%)
Mutual labels:  deduplication
lieu
Dedupe/batch geocode addresses and venues around the world with libpostal
Stars: ✭ 73 (-59.89%)
Mutual labels:  deduplication
Fastcdc Rs
FastCDC implementation in Rust
Stars: ✭ 31 (-82.97%)
Mutual labels:  deduplication
Restic
Fast, secure, efficient backup program
Stars: ✭ 15,105 (+8199.45%)
Mutual labels:  deduplication
Dupeguru
Find duplicate files
Stars: ✭ 2,385 (+1210.44%)
Mutual labels:  deduplication
Fingerprints
Make it easier to compare and cross-reference the names of companies and people by applying strong normalisation.
Stars: ✭ 91 (-50%)
Mutual labels:  deduplication

pylsh

pylsh is a Python implementation of locality sensitive hashing with minhash. It is very useful for detecting near duplicate documents.

The implementation uses the MurmurHash v3 library to create document finger prints.

Cython is needed if you want to regenerate the .cpp files for the hashing and shingling code. By default the setup script uses the pregenerated .cpp sources, you can change this with the USE_CYTHON flag in setup.py

NumPy is needed to run the code.

The MurmurHash3 library is distributed under the MIT license. More information https://github.com/aappleby/smhasher

examples

For an overview of how LSH works and how to set the parameters see this notebook. The notebook is also available in the examples directory.

installation

> git clone https://github.com/mattilyra/LSH
> cd LSH
> python setup.py install
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].