All Projects → fake-name → IntraArchiveDeduplicator

fake-name / IntraArchiveDeduplicator

Licence: BSD-3-Clause license
Tool for managing data-deduplication within extant compressed archive files, along with a relatively performant BK tree implementation for fuzzy image searching.

Programming Languages

python
139335 projects - #7 most used programming language
CSS
56736 projects
C++
36643 projects - #6 most used programming language
HTML
75241 projects
shell
77523 projects

Projects that are alternatives of or similar to IntraArchiveDeduplicator

pqlite
⚡ A fast embedded library for approximate nearest neighbor search
Stars: ✭ 141 (+62.07%)
Mutual labels:  image-search
web-image-crawler
Code to download web-images
Stars: ✭ 15 (-82.76%)
Mutual labels:  image-search
yadf
Yet Another Dupes Finder
Stars: ✭ 32 (-63.22%)
Mutual labels:  deduplication
dedupsqlfs
Deduplicating filesystem via Python3, FUSE and SQLite
Stars: ✭ 24 (-72.41%)
Mutual labels:  deduplication
Neural-Scam-Artist
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
Stars: ✭ 18 (-79.31%)
Mutual labels:  deduplication
img classification deep learning
No description or website provided.
Stars: ✭ 19 (-78.16%)
Mutual labels:  image-search
Fergun
An utility Discord bot written in C# using Discord.Net
Stars: ✭ 26 (-70.11%)
Mutual labels:  image-search
fuzzysearch
A site that allows you to reverse image search millions of furry images in under a second
Stars: ✭ 34 (-60.92%)
Mutual labels:  image-search
mail-deduplicate
📧 CLI to deduplicate mails from mail boxes.
Stars: ✭ 134 (+54.02%)
Mutual labels:  deduplication
cargo-limit
Cargo with less noise: warnings are skipped until errors are fixed, Neovim integration, etc.
Stars: ✭ 105 (+20.69%)
Mutual labels:  deduplication
Frost
A backup program that does deduplication, compression, encryption
Stars: ✭ 25 (-71.26%)
Mutual labels:  deduplication
deduplication
Fast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.
Stars: ✭ 59 (-32.18%)
Mutual labels:  deduplication
zpaqfranz
Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
Stars: ✭ 86 (-1.15%)
Mutual labels:  deduplication
SmartImage
Reverse image search tool (SauceNao, ImgOps, trace.moe, and more)
Stars: ✭ 346 (+297.7%)
Mutual labels:  image-search
pupyl
🧿 Pupyl is a really fast image search library which you can index your own (millions of) images and find similar images in milliseconds.
Stars: ✭ 83 (-4.6%)
Mutual labels:  image-search
nomenklatura
Framework and command-line tools for integrating FollowTheMoney data streams from multiple sources
Stars: ✭ 158 (+81.61%)
Mutual labels:  deduplication
weapp-saucenao
微信小程序: 识图娘
Stars: ✭ 19 (-78.16%)
Mutual labels:  image-search
iqdb tagger
Search IQDB from CLI
Stars: ✭ 18 (-79.31%)
Mutual labels:  image-search
MoTIS
Mobile(iOS) Text-to-Image search powered by multimodal semantic representation models(e.g., OpenAI's CLIP). Accepted at NAACL 2022.
Stars: ✭ 60 (-31.03%)
Mutual labels:  image-search
zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
Stars: ✭ 655 (+652.87%)
Mutual labels:  deduplication

IntraArchiveDeduplicator

Tool for managing data-deduplication within extant compressed archive files, with a heavy focus on Manga/Comic-book archive files.

This is a rather exotic tool that is intended to allow fairly fast duplicate detection for files within compressed archives.

It maintains a database of hashes for all files it scans, and actually recurses into compressed archives to scan the files within the archives, which should allow detection of archives with duplicate contents, even if the archives are compressed using different compression algorithms.

There are also facilities for searching by image similarity, using a custom tree system.

The image similarity system runs as a server on top of an existing PostgreSQL server, as it is implemented in python (actually Cython, but basically python). It's fairly memory hungry. Currently, ~12M hashes takes about 5 GB of RAM, or ~1Kbyte/hash. There is some room for optimization here.

Theoretically, each hash should take 64*8 + 8 + (8 * number of IDs at each node) (+ a few housekeeping) bytes. However, right now, a number of the node attributes are stored as hashtables (the child-links, for example), so they do not take as much space as they theoretically will if every child pointer pointed to a actual valid node.

Right now, the scanning and DB maintenance functionality is largely functional, but the logic to actually do deduplication is very preliminary. My MangaCMS project already has some support for detecting when a newly downloaded file has entirely duplicated content, and the automatic removal of the new file to prevent further introduction of duplicates.

Dependencies:

  • PostgreSQL >= 9.3 (Possibly any >9.0?)
  • Psycopg2
  • Cython
  • RPyC
  • Colorama
  • python-sql

For PHashing:

  • Numpy
  • Scipy
  • Pillow

For Unit testing:

  • Coverage.py
  • Bitstring

There are fairly extensive unit tests for the DB API, as well as the BK-tree and the phashing systems. However, the great majority of the tests (all the DB API tests, which are 80%+ of them) require a local postgres instance, so they're not suitable for CI integration.

BK-Tree

The BK-tree implementation has been broken out into an independent library installable via pip. It's hosted here. Many thanks to user @gpip for doing the legwork making it portable.


TODO: Moved counter in CPPBKTree delete operation doesn't work.

Long filenames break hasher? Bad filename: "00PGw6sr1r7fBlHub52mIRCQ4Nd5jTn0n31_2B3HvgCJTHDJcBK3qV0H7k7gdTwYTiaowENq0D8vK0hBDL_5d88vcInWqPRs4H8GZQYHRlzrHWUYNKiD0QRoeEOz2AztX4nF8v0=w1600"

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].