LshLocality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
ResticFast, secure, efficient backup program
KvdoA pair of kernel modules which provide pools of deduplicated and/or compressed block storage.
DejavuQuickly detect already witnessed data.
VdoUserspace tools for managing VDO volumes.
Spark LucenerddSpark RDD with Lucene's query and entity linkage capabilities
FingerprintsMake it easier to compare and cross-reference the names of companies and people by applying strong normalisation.
RltkRecord Linkage ToolKit (Find and link entities)
RmlintExtremely fast tool to remove duplicates and other lint from your filesystem
Dupandas📊 python package for performing deduplication using flexible text matching and cleaning in pandas dataframe
BorgmaticSimple, configuration-driven backup software for servers and workstations
JdupesA powerful duplicate file finder and an enhanced fork of 'fdupes'.
RdedupData deduplication engine, supporting optional compression and public key encryption.
TalismanStraightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
RecordlinkageA toolkit for record linkage and duplicate detection in Python
KopiaCross-platform backup tool for Windows, macOS & Linux with fast, incremental backups, client-side end-to-end encryption, compression and data deduplication. CLI and GUI included.
LibpostalA C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
lieuDedupe/batch geocode addresses and venues around the world with libpostal
UMICollapseAccelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
gencoreGenerate duplex/single consensus reads to reduce sequencing noises and remove duplications
entity-embedPyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
splinkImplementation of Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters
dduperFast block-level out-of-band BTRFS deduplication tool.
acid-storeA library for secure, deduplicated, transactional, and verifiable data storage
IntraArchiveDeduplicatorTool for managing data-deduplication within extant compressed archive files, along with a relatively performant BK tree implementation for fuzzy image searching.
yadfYet Another Dupes Finder
cargo-limitCargo with less noise: warnings are skipped until errors are fixed, Neovim integration, etc.
zinggScalable identity resolution, entity resolution, data mastering and deduplication using ML
zpaqfranzDeduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
Neural-Scam-ArtistWeb Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
deduplicationFast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.
FrostA backup program that does deduplication, compression, encryption
dedupsqlfsDeduplicating filesystem via Python3, FUSE and SQLite
nomenklaturaFramework and command-line tools for integrating FollowTheMoney data streams from multiple sources