fake-name / IntraArchiveDeduplicator

Licence: BSD-3-Clause license

Tool for managing data-deduplication within extant compressed archive files, along with a relatively performant BK tree implementation for fuzzy image searching.

Programming Languages

python

139335 projects - #7 most used programming language

CSS

56736 projects

C++

36643 projects - #6 most used programming language

HTML

75241 projects

shell

77523 projects

Projects that are alternatives of or similar to IntraArchiveDeduplicator

pqlite

⚡ A fast embedded library for approximate nearest neighbor search

Stars: ✭ 141 (+62.07%)

Mutual labels: image-search

web-image-crawler

Code to download web-images

Stars: ✭ 15 (-82.76%)

Mutual labels: image-search

yadf

Yet Another Dupes Finder

Stars: ✭ 32 (-63.22%)

Mutual labels: deduplication

dedupsqlfs

Deduplicating filesystem via Python3, FUSE and SQLite

Stars: ✭ 24 (-72.41%)

Mutual labels: deduplication

Neural-Scam-Artist

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Stars: ✭ 18 (-79.31%)

Mutual labels: deduplication

img classification deep learning

No description or website provided.

Stars: ✭ 19 (-78.16%)

Mutual labels: image-search

Fergun

An utility Discord bot written in C# using Discord.Net

Stars: ✭ 26 (-70.11%)

Mutual labels: image-search

fuzzysearch

A site that allows you to reverse image search millions of furry images in under a second

Stars: ✭ 34 (-60.92%)

Mutual labels: image-search

mail-deduplicate

📧 CLI to deduplicate mails from mail boxes.

Stars: ✭ 134 (+54.02%)

Mutual labels: deduplication

cargo-limit

Cargo with less noise: warnings are skipped until errors are fixed, Neovim integration, etc.

Stars: ✭ 105 (+20.69%)

Mutual labels: deduplication

Frost

A backup program that does deduplication, compression, encryption

Stars: ✭ 25 (-71.26%)

Mutual labels: deduplication

deduplication

Fast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.

Stars: ✭ 59 (-32.18%)

Mutual labels: deduplication

zpaqfranz

Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix

Stars: ✭ 86 (-1.15%)

Mutual labels: deduplication

SmartImage

Reverse image search tool (SauceNao, ImgOps, trace.moe, and more)

Stars: ✭ 346 (+297.7%)

Mutual labels: image-search

pupyl

🧿 Pupyl is a really fast image search library which you can index your own (millions of) images and find similar images in milliseconds.

Stars: ✭ 83 (-4.6%)

Mutual labels: image-search

nomenklatura

Framework and command-line tools for integrating FollowTheMoney data streams from multiple sources

Stars: ✭ 158 (+81.61%)

Mutual labels: deduplication

weapp-saucenao

微信小程序: 识图娘

Stars: ✭ 19 (-78.16%)

Mutual labels: image-search

iqdb tagger

Search IQDB from CLI

Stars: ✭ 18 (-79.31%)

Mutual labels: image-search

MoTIS

Mobile(iOS) Text-to-Image search powered by multimodal semantic representation models(e.g., OpenAI's CLIP). Accepted at NAACL 2022.

Stars: ✭ 60 (-31.03%)

Mutual labels: image-search

zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML

Stars: ✭ 655 (+652.87%)

Mutual labels: deduplication

View All Similar Projects ➔

IntraArchiveDeduplicator

Tool for managing data-deduplication within extant compressed archive files, with a heavy focus on Manga/Comic-book archive files.

This is a rather exotic tool that is intended to allow fairly fast duplicate detection for files within compressed archives.

It maintains a database of hashes for all files it scans, and actually recurses into compressed archives to scan the files within the archives, which should allow detection of archives with duplicate contents, even if the archives are compressed using different compression algorithms.

There are also facilities for searching by image similarity, using a custom tree system.

The image similarity system runs as a server on top of an existing PostgreSQL server, as it is implemented in python (actually Cython, but basically python). It's fairly memory hungry. Currently, ~12M hashes takes about 5 GB of RAM, or ~1Kbyte/hash. There is some room for optimization here.

Theoretically, each hash should take 64*8 + 8 + (8 * number of IDs at each node) (+ a few housekeeping) bytes. However, right now, a number of the node attributes are stored as hashtables (the child-links, for example), so they do not take as much space as they theoretically will if every child pointer pointed to a actual valid node.

Right now, the scanning and DB maintenance functionality is largely functional, but the logic to actually do deduplication is very preliminary. My MangaCMS project already has some support for detecting when a newly downloaded file has entirely duplicated content, and the automatic removal of the new file to prevent further introduction of duplicates.

Dependencies:

PostgreSQL >= 9.3 (Possibly any >9.0?)
Psycopg2
Cython
RPyC
Colorama
python-sql

For PHashing:

Numpy
Scipy
Pillow

For Unit testing:

Coverage.py
Bitstring

There are fairly extensive unit tests for the DB API, as well as the BK-tree and the phashing systems. However, the great majority of the tests (all the DB API tests, which are 80%+ of them) require a local postgres instance, so they're not suitable for CI integration.

BK-Tree

The BK-tree implementation has been broken out into an independent library installable via pip. It's hosted here. Many thanks to user @gpip for doing the legwork making it portable.

TODO: Moved counter in CPPBKTree delete operation doesn't work.

Long filenames break hasher? Bad filename: "00PGw6sr1r7fBlHub52mIRCQ4Nd5jTn0n31_2B3HvgCJTHDJcBK3qV0H7k7gdTwYTiaowENq0D8vK0hBDL_5d88vcInWqPRs4H8GZQYHRlzrHWUYNKiD0QRoeEOz2AztX4nF8v0=w1600"

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

fake-name / IntraArchiveDeduplicator

Programming Languages

Labels

Projects that are alternatives of or similar to IntraArchiveDeduplicator

IntraArchiveDeduplicator

BK-Tree