All Projects → jRimbault → yadf

jRimbault / yadf

Licence: MIT license
Yet Another Dupes Finder

Programming Languages

rust
11053 projects
python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to yadf

dduper
Fast block-level out-of-band BTRFS deduplication tool.
Stars: ✭ 108 (+237.5%)
Mutual labels:  dedupe, deduplication
Restic
Fast, secure, efficient backup program
Stars: ✭ 15,105 (+47103.13%)
Mutual labels:  dedupe, deduplication
dupe-krill
A fast file deduplicator
Stars: ✭ 147 (+359.38%)
Mutual labels:  dedupe, file-deduplication
mail-deduplicate
📧 CLI to deduplicate mails from mail boxes.
Stars: ✭ 134 (+318.75%)
Mutual labels:  dedupe, deduplication
zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
Stars: ✭ 655 (+1946.88%)
Mutual labels:  dedupe, deduplication
Dedupe
🆔 A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
Stars: ✭ 3,241 (+10028.13%)
Mutual labels:  dedupe
zpaqfranz
Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
Stars: ✭ 86 (+168.75%)
Mutual labels:  deduplication
mediadc
Nextcloud Media Duplicate Collector application
Stars: ✭ 57 (+78.13%)
Mutual labels:  duplicate-detection
Lsh
Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
Stars: ✭ 182 (+468.75%)
Mutual labels:  deduplication
bugrepo
A collection of publicly available bug reports
Stars: ✭ 93 (+190.63%)
Mutual labels:  duplicate-detection
django-super-deduper
Utilities for de-duping Django model instances
Stars: ✭ 27 (-15.62%)
Mutual labels:  dedupe
dedupsqlfs
Deduplicating filesystem via Python3, FUSE and SQLite
Stars: ✭ 24 (-25%)
Mutual labels:  deduplication
deduplication
Fast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.
Stars: ✭ 59 (+84.38%)
Mutual labels:  deduplication
videohash
Near Duplicate Video Detection (Perceptual Video Hashing) - Get a 64-bit comparable hash-value for any video.
Stars: ✭ 155 (+384.38%)
Mutual labels:  duplicate-detection
nomenklatura
Framework and command-line tools for integrating FollowTheMoney data streams from multiple sources
Stars: ✭ 158 (+393.75%)
Mutual labels:  deduplication
Data Matching Software
A list of free data matching and record linkage software.
Stars: ✭ 206 (+543.75%)
Mutual labels:  deduplication
Frost
A backup program that does deduplication, compression, encryption
Stars: ✭ 25 (-21.87%)
Mutual labels:  deduplication
cargo-limit
Cargo with less noise: warnings are skipped until errors are fixed, Neovim integration, etc.
Stars: ✭ 105 (+228.13%)
Mutual labels:  deduplication
duplex
Duplicate code finder for Elixir
Stars: ✭ 20 (-37.5%)
Mutual labels:  dedupe
Neural-Scam-Artist
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
Stars: ✭ 18 (-43.75%)
Mutual labels:  deduplication

YADF — Yet Another Dupes Finder

It's fast on my machine.

Installation

Prebuilt Packages

Executable binaries for some platforms are available in the releases section.

Building from source

  1. Install Rust Toolchain
  2. Run cargo install yadf

Usage

yadf defaults:

  • search current working directory $PWD
  • output format is the same as the "standard" fdupes, newline separated groups
  • descends automatically into subdirectories
  • search includes every files (including empty files)
yadf # find duplicate files in current directory
yadf ~/Documents ~/Pictures # find duplicate files in two directories
yadf --depth 0 file1 file2 # compare two files
yadf --depth 1 # find duplicates in current directory without descending
fd --type d a | yadf --depth 1 # find directories with an "a" and search them for duplicates without descending
fd --type f a | yadf # find files with an "a" and check them for duplicates

Filtering

yadf --min 100M # find duplicate files of at least 100 MB
yadf --max 100M # find duplicate files below 100 MB
yadf --pattern '*.jpg' # find duplicate jpg
yadf --regex '^g' # find duplicate starting with 'g'
yadf --rfactor over:10 # find files with more than 10 copies
yadf --rfactor under:10 # find files with less than 10 copies
yadf --rfactor equal:1 # find unique files

Formatting

Look up the help for a list of output formats yadf -h.

yadf -f json
yadf -f fdupes
yadf -f csv
yadf -f ldjson
Help output.
yadf 0.13.1
Yet Another Dupes Finder

USAGE:
    yadf [FLAGS] [OPTIONS] [paths]...

FLAGS:
    -H, --hard-links    Treat hard links to same file as duplicates
    -h, --help          Prints help information
    -n, --no-empty      Excludes empty files
    -q, --quiet         Pass many times for less log output
    -V, --version       Prints version information
    -v, --verbose       Pass many times for more log output

OPTIONS:
    -a, --algorithm <algorithm>    Hashing algorithm [default: AHash]  [possible values: AHash,
                                   Highway, MetroHash, SeaHash, XxHash]
    -f, --format <format>          Output format [default: Fdupes]  [possible values: Csv, Fdupes,
                                   Json, JsonPretty, LdJson, Machine]
        --max <size>               Maximum file size
    -d, --depth <depth>            Maximum recursion depth
        --min <size>               Minimum file size
    -p, --pattern <glob>           Check files with a name matching a glob pattern, see:
                                   https://docs.rs/globset/0.4.6/globset/index.html#syntax
    -R, --regex <regex>            Check files with a name matching a Perl-style regex, see:
                                   https://docs.rs/regex/1.4.2/regex/index.html#syntax
        --rfactor <rfactor>        Replication factor [under|equal|over]:n

ARGS:
    <paths>...    Directories to search

For sizes, K/M/G/T[B|iB] suffixes can be used (case-insensitive).

Notes on the algorithm

Most¹ dupe finders follow a 3 steps algorithm:

  1. group files by their size
  2. group files by their first few bytes
  3. group files by their entire content

yadf skips the first step, and only does the steps 2 and 3, preferring hashing rather than byte comparison. In my tests having the first step on a SSD actually slowed down the program. yadf makes heavy use of the standard library BTreeMap, it uses a cache aware implementation avoiding too many cache misses. yadf uses the parallel walker provided by ignore (disabling its ignore features) and rayon's parallel iterators to do each of these 2 steps in parallel.

¹: some need a different algorithm to support different features or different performance trade-offs

Design goals

I sought out to build a high performing artefact by assembling together libraries doing the actual work, nothing here is custom made, it's all "off-the-shelf" software.

Benchmarks

The performance of yadf is heavily tied to the hardware, specifically the NVMe SSD. I recommend fclones as it has more hardware heuristics. and in general more features. yadf on HDDs is terrible.

My home directory contains upwards of 700k paths and 39 GB of data, and is probably a pathological case of file duplication with all the node_modules, python virtual environments, rust target, etc. Arguably, the most important measure here is the mean time when the filesystem cache is cold.

Program (warm filesystem cache) Version Mean [s] Min [s] Max [s] Relative
fclones 0.8.0 4.107 ± 0.045 4.065 4.189 1.58 ± 0.04
jdupes 1.14.0 11.982 ± 0.038 11.924 12.030 4.60 ± 0.11
ddh 0.11.3 10.602 ± 0.062 10.521 10.678 4.07 ± 0.10
rmlint 2.9.0 17.640 ± 0.119 17.426 17.833 6.77 ± 0.17
dupe-krill 1.4.4 9.110 ± 0.040 9.053 9.154 3.50 ± 0.08
fddf 1.7.0 5.630 ± 0.049 5.562 5.717 2.16 ± 0.05
yadf 0.14.1 2.605 ± 0.062 2.517 2.676 1.00
Program (cold filesystem cache) Version Mean [s]
fclones 0.8.0 19.452
jdupes 1.14.0 129.132
ddh 0.11.3 27.241
rmlint 2.9.0 67.580
dupe-krill 1.4.4 127.860
fddf 1.7.0 32.661
yadf 0.13.1 21.554

fdupes is excluded from this benchmark because it's really slow.

The script used to benchmark can be read here.

Hardware used.

Extract from neofetch and hwinfo --disk:

  • OS: Ubuntu 20.04.1 LTS x86_64
  • Host: XPS 15 9570
  • Kernel: 5.4.0-42-generic
  • CPU: Intel i9-8950HK (12) @ 4.800GHz
  • Memory: 4217MiB / 31755MiB
  • Disk:
    • model: "SK hynix Disk"
    • driver: "nvme"
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].