zinggScalable identity resolution, entity resolution, data mastering and deduplication using ML
Stars: ✭ 655 (+1946.88%)
mail-deduplicate📧 CLI to deduplicate mails from mail boxes.
Stars: ✭ 134 (+318.75%)
ResticFast, secure, efficient backup program
Stars: ✭ 15,105 (+47103.13%)
dduperFast block-level out-of-band BTRFS deduplication tool.
Stars: ✭ 108 (+237.5%)
dupe-krillA fast file deduplicator
Stars: ✭ 147 (+359.38%)
cargo-limitCargo with less noise: warnings are skipped until errors are fixed, Neovim integration, etc.
Stars: ✭ 105 (+228.13%)
bugrepoA collection of publicly available bug reports
Stars: ✭ 93 (+190.63%)
zpaqfranzDeduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
Stars: ✭ 86 (+168.75%)
Neural-Scam-ArtistWeb Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
Stars: ✭ 18 (-43.75%)
deduplicationFast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.
Stars: ✭ 59 (+84.38%)
mediadcNextcloud Media Duplicate Collector application
Stars: ✭ 57 (+78.13%)
FrostA backup program that does deduplication, compression, encryption
Stars: ✭ 25 (-21.87%)
dedupsqlfsDeduplicating filesystem via Python3, FUSE and SQLite
Stars: ✭ 24 (-25%)
videohashNear Duplicate Video Detection (Perceptual Video Hashing) - Get a 64-bit comparable hash-value for any video.
Stars: ✭ 155 (+384.38%)
nomenklaturaFramework and command-line tools for integrating FollowTheMoney data streams from multiple sources
Stars: ✭ 158 (+393.75%)
duplexDuplicate code finder for Elixir
Stars: ✭ 20 (-37.5%)
removedupesRemove Duplicate Messages
Stars: ✭ 52 (+62.5%)
Dedupe🆔 A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
Stars: ✭ 3,241 (+10028.13%)
LshLocality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
Stars: ✭ 182 (+468.75%)
KvdoA pair of kernel modules which provide pools of deduplicated and/or compressed block storage.
Stars: ✭ 168 (+425%)
DupeguruFind duplicate files
Stars: ✭ 2,385 (+7353.13%)
DejavuQuickly detect already witnessed data.
Stars: ✭ 151 (+371.88%)
VdoUserspace tools for managing VDO volumes.
Stars: ✭ 138 (+331.25%)
Spark LucenerddSpark RDD with Lucene's query and entity linkage capabilities
Stars: ✭ 114 (+256.25%)
FingerprintsMake it easier to compare and cross-reference the names of companies and people by applying strong normalisation.
Stars: ✭ 91 (+184.38%)
RltkRecord Linkage ToolKit (Find and link entities)
Stars: ✭ 71 (+121.88%)
RmlintExtremely fast tool to remove duplicates and other lint from your filesystem
Stars: ✭ 996 (+3012.5%)
Fastcdc RsFastCDC implementation in Rust
Stars: ✭ 31 (-3.12%)
Dupandas📊 python package for performing deduplication using flexible text matching and cleaning in pandas dataframe
Stars: ✭ 20 (-37.5%)
BorgmaticSimple, configuration-driven backup software for servers and workstations
Stars: ✭ 902 (+2718.75%)
JdupesA powerful duplicate file finder and an enhanced fork of 'fdupes'.
Stars: ✭ 790 (+2368.75%)
RdedupData deduplication engine, supporting optional compression and public key encryption.
Stars: ✭ 690 (+2056.25%)
TalismanStraightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
Stars: ✭ 584 (+1725%)
RecordlinkageA toolkit for record linkage and duplicate detection in Python
Stars: ✭ 532 (+1562.5%)
KopiaCross-platform backup tool for Windows, macOS & Linux with fast, incremental backups, client-side end-to-end encryption, compression and data deduplication. CLI and GUI included.
Stars: ✭ 507 (+1484.38%)
AlertmanagerPrometheus Alertmanager
Stars: ✭ 4,574 (+14193.75%)
LibpostalA C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
Stars: ✭ 3,312 (+10250%)
lieuDedupe/batch geocode addresses and venues around the world with libpostal
Stars: ✭ 73 (+128.13%)
UMICollapseAccelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
Stars: ✭ 31 (-3.12%)
record-linkage-resourcesResources for tackling record linkage / deduplication / data matching problems
Stars: ✭ 67 (+109.38%)
gencoreGenerate duplex/single consensus reads to reduce sequencing noises and remove duplications
Stars: ✭ 91 (+184.38%)
entity-embedPyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
Stars: ✭ 96 (+200%)
splinkImplementation of Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters
Stars: ✭ 181 (+465.63%)
acid-storeA library for secure, deduplicated, transactional, and verifiable data storage
Stars: ✭ 48 (+50%)
IntraArchiveDeduplicatorTool for managing data-deduplication within extant compressed archive files, along with a relatively performant BK tree implementation for fuzzy image searching.
Stars: ✭ 87 (+171.88%)
OpenStaticAnalyzerOpenStaticAnalyzer is a source code analyzer tool, which can perform deep static analysis of the source code of complex systems.
Stars: ✭ 19 (-40.62%)
apolloAdvanced similarity and duplicate source code proof of concept for our research efforts.
Stars: ✭ 49 (+53.13%)
snowmanWelcome to Snowman App – a Data Matching Benchmark Platform.
Stars: ✭ 25 (-21.87%)
twlyWanna get DRY? Static analysis tool for detecting repeat code.
Stars: ✭ 42 (+31.25%)