All Categories → Data Processing → deduplication

Top 41 deduplication open source projects

Data Matching Software
A list of free data matching and record linkage software.
Lsh
Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
Restic
Fast, secure, efficient backup program
Kvdo
A pair of kernel modules which provide pools of deduplicated and/or compressed block storage.
Dupeguru
Find duplicate files
Dejavu
Quickly detect already witnessed data.
Vdo
Userspace tools for managing VDO volumes.
Spark Lucenerdd
Spark RDD with Lucene's query and entity linkage capabilities
Fingerprints
Make it easier to compare and cross-reference the names of companies and people by applying strong normalisation.
Rltk
Record Linkage ToolKit (Find and link entities)
Rmlint
Extremely fast tool to remove duplicates and other lint from your filesystem
Fastcdc Rs
FastCDC implementation in Rust
Dupandas
📊 python package for performing deduplication using flexible text matching and cleaning in pandas dataframe
Borgmatic
Simple, configuration-driven backup software for servers and workstations
Jdupes
A powerful duplicate file finder and an enhanced fork of 'fdupes'.
Rdedup
Data deduplication engine, supporting optional compression and public key encryption.
Talisman
Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
Recordlinkage
A toolkit for record linkage and duplicate detection in Python
Kopia
Cross-platform backup tool for Windows, macOS & Linux with fast, incremental backups, client-side end-to-end encryption, compression and data deduplication. CLI and GUI included.
Libpostal
A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
lieu
Dedupe/batch geocode addresses and venues around the world with libpostal
UMICollapse
Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
RocketMQDedupListener
RocketMQ消息幂等去重消费者,支持使用MySQL或者Redis做幂等表,开箱即用
record-linkage-resources
Resources for tackling record linkage / deduplication / data matching problems
entity-embed
PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
splink
Implementation of Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters
dduper
Fast block-level out-of-band BTRFS deduplication tool.
IntraArchiveDeduplicator
Tool for managing data-deduplication within extant compressed archive files, along with a relatively performant BK tree implementation for fuzzy image searching.
cargo-limit
Cargo with less noise: warnings are skipped until errors are fixed, Neovim integration, etc.
zpaqfranz
Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
Neural-Scam-Artist
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
deduplication
Fast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.
Frost
A backup program that does deduplication, compression, encryption
dedupsqlfs
Deduplicating filesystem via Python3, FUSE and SQLite
nomenklatura
Framework and command-line tools for integrating FollowTheMoney data streams from multiple sources
1-41 of 41 deduplication projects