All Projects → knjcode → Imgdupes

knjcode / Imgdupes

Finding and deleting near-duplicate images based on perceptual hash.

Programming Languages

python
139335 projects - #7 most used programming language

Labels

Projects that are alternatives of or similar to Imgdupes

React Files
A file input (dropzone) management component for React
Stars: ✭ 126 (-6.67%)
Mutual labels:  image
Imagehash
🌄 Perceptual image hashing for PHP
Stars: ✭ 1,744 (+1191.85%)
Mutual labels:  image
Image Resizer
Resize/Crop/Rotate/Pad images in Clojure without any native install. Oh and do it Fast.
Stars: ✭ 133 (-1.48%)
Mutual labels:  image
Stfalconimageviewer
A simple and customizable Android full-screen image viewer with shared image transition support, "pinch to zoom" and "swipe to dismiss" gestures
Stars: ✭ 1,734 (+1184.44%)
Mutual labels:  image
Weibo image uploader
PHP 实现的微博图床上传轮子
Stars: ✭ 129 (-4.44%)
Mutual labels:  image
Fabric Photo
基于canvas的前端图片编辑器
Stars: ✭ 132 (-2.22%)
Mutual labels:  image
Ucrop
Image Cropping Library for Android
Stars: ✭ 11,003 (+8050.37%)
Mutual labels:  image
Vifmimg
Image previews using Überzug for Vifm (vi file manager)
Stars: ✭ 135 (+0%)
Mutual labels:  image
Image Promise
🎑🤞 Load one or more images, return a promise. Tiny, browser-only, no dependencies.
Stars: ✭ 129 (-4.44%)
Mutual labels:  image
Romwbw
System Software for Z80/Z180 Computers
Stars: ✭ 133 (-1.48%)
Mutual labels:  image
Winmerge
WinMerge is an Open Source differencing and merging tool for Windows. WinMerge can compare both folders and files, presenting differences in a visual text format that is easy to understand and handle.
Stars: ✭ 2,358 (+1646.67%)
Mutual labels:  image
Bbwebimage
A high performance Swift library for downloading, caching and editing web images asynchronously.
Stars: ✭ 128 (-5.19%)
Mutual labels:  image
Zoomy
Adds seamless scrollView and instagram like zooming to UIImageViews in any view hierarchy.
Stars: ✭ 130 (-3.7%)
Mutual labels:  image
Lerc
Limited Error Raster Compression
Stars: ✭ 126 (-6.67%)
Mutual labels:  image
Hrconvert2
A self-hosted, drag-and-drop, & nosql file conversion server that supports 62x file formats.
Stars: ✭ 132 (-2.22%)
Mutual labels:  image
Bitmap
C++ Bitmap Library
Stars: ✭ 125 (-7.41%)
Mutual labels:  image
Collage maker
Picture collage maker in Python
Stars: ✭ 131 (-2.96%)
Mutual labels:  image
Releases
WLAN Pi Release Repository
Stars: ✭ 135 (+0%)
Mutual labels:  image
Image Focus
A dependency free utility for cropping images based on a focus point ~2.13kB gzipped
Stars: ✭ 134 (-0.74%)
Mutual labels:  image
Kjemitterview
粒子效果、扩展、好用的工具等等,Button图文混排、点击事件封装、扩大点击域、点赞粒子效果,手势封装、圆角渐变、倒影、内阴影处理、Xib属性、识别网址超链接,图片加工处理、对花铺贴效果、滤镜渲染、泛洪算法,_KJMacros常用宏定义,Label富文本,自定义动画选中控件,Alert控件,数组和字典防崩处理,数组算法处理等等等
Stars: ✭ 133 (-1.48%)
Mutual labels:  image

imgdupes

imgdupes is a command line tool for checking and deleting near-duplicate images based on perceptual hash from the target directory.

video_capture Images by Caltech 101 dataset that semi-deduped for demonstration.

It is better to pre-deduplicate identical images with fdupes or jdupes in advance.
Then, you can check and delete near-duplicate images using imgdupes with an operation similar to the fdupes command.

For large dataset

It is possible to speed up dedupe process by approximate nearest neighbor search of hamming distance using NGT or hnsw. See Against large dataset section for details.

Install

To install, simply use pip:

$ pip install imgdupes

Usage

The following example is sample command to find sets of near-duplicate images with Hamming distance of phash less than 4 from the target directory.
To search images recursively from the target directory, add -r or --recursive option.

$ imgdupes --recursive target_dir phash 4
target_dir/airplane_0583.jpg
target_dir/airplane_0800.jpg

target_dir/watch_0122.jpg
target_dir/watch_0121.jpg

By default, imgdupes displays a list of duplicate images list and exits.
To display preserve or delete images prompt, use the -d or --delete option.

If you are using iTerm 2, you can display a set of images on the terminal with the -c or --imgcat option.

$ imgdupes --recursive --delete --imgcat 101_ObjectCategories phash 4

The set of images are sorted in ascending order of file size and displayed together with the pixel size of the image, you can choose which image to preserve.

With -N or --noprompt option, you can preserve the first file in each set of duplicates and delete the rest without prompting.

$ imgdupes -rdN 101_ObjectCategories phash 0

To take input from a list of files

Use --files-from or -T option to take input from a list of files.

$ imgdupes -T image_list.txt phash 0

For example, create image_list.txt as below.

101_ObjectCategories/Faces/image_0345.jpg
101_ObjectCategories/Motorbikes/image_0269.jpg
101_ObjectCategories/Motorbikes/image_0735.jpg
101_ObjectCategories/brain/image_0047.jpg
101_ObjectCategories/headphone/image_0034.jpg
101_ObjectCategories/dollar_bill/image_0038.jpg
101_ObjectCategories/ferry/image_0020.jpg
101_ObjectCategories/tick/image_0049.jpg
101_ObjectCategories/Faces_easy/image_0283.jpg
101_ObjectCategories/watch/image_0171.jpg

Find near-duplicated images from an image you specified

Use --query option to specify a query image file.

$ imgdupes --recursive target_dir --query target_dir/airplane_0583.jpg phash 4
Query: sample_airplane.png

target_dir/airplane_0583.jpg
target_dir/airplane_0800.jpg

Against large dataset

imgdupes supports approximate nearest neighbor search of hamming distance using NGT or hnsw.

To dedupe images using NGT, run with --ngt option after installing NGT and python binding.

$ imgdupes -rdc --ngt 101_ObjectCategories phash 4

Notice: --ngt option is enabled by default from version 0.1.0.

For instructions on installing NGT and python binding, see NGT and python NGT.

To dedupe images using hnsw, run with --hnsw option after installing hnsw python binding.

$ imgdupes -rdc --hnsw 101_ObjectCategories phash 4

Fast exact searching

imgdupes supports exact nearest neighbor search of hamming distance using faiss (IndexFlatL2).

To dedupe images using faiss, run with --faiss-flat option after installing faiss python binding.

$ imgdupes -rdc --faiss-flat 101_ObjectCategories phash 4

Using imgdupes without installing it with docker

You can use imgdupes without installing it using a pre-build docker container image.
NGT, hnsw and faiss are already installed in this image.

Place the target directory in the current directory and execute the following command.

$ docker run -it -v $PWD:/app knjcode/imgdupes -rdc target_dir phash 0

When docker run, current directory is mounted inside the container and referenced from imgdupes.

By aliasing the command, you can use imgdupes as installed.

$ alias imgdupes="docker run -it -v $PWD:/app knjcode/imgdupes"
$ imgdupes -rdc target_dir phash 0

To upgrade imgdupes docker image, you can pull the docker image as below.

$ docker pull knjcode/imgdupes

Available hash algorithm

imgdupes uses the ImageHash to calculate perceptual hash (except for phash_org algorithm).

  • ahash: average hashing

  • phash: perception hashing (using only the 8x8 DCT low-frequency values including the first term)

  • dhash: difference hashing

  • whash: wavelet hashing

  • phash_org: perception hashing (fix algorithm from ImageHash implementation)

    using only the 8x8 DCT low-frequency values and excluding the first term since the DC coefficient can be significantly different from the other values and will throw off the average.

Options

-r --recursive

search images recursively from the target directory (default=False)

-d --delete

prompt user for files to preserve and delete (default=False)

-c --imgcat

display duplicate images for iTerm2 (default=False)

-m --summarize

summarize dupe information

-N --noprompt

together with --delete, preserve the first file in each set of duplicates and delete the rest without prompting the user

--query <image filename>

find image files that are duplicated or similar to the specified image file from the target directory

--hash-bits 64

bits of perceptual hash (default=64)

The number of bits specifies the value that is the square of n.
For example, you can specify 64(8^2), 144(12^2), 256(16^2), etc.

--sort <sort_type>

how to sort duplicate image files (default=filesize)

You can specify following types:

  • filesize: sort by filesize in descending order
  • filepath: sort by filepath in ascending order
  • imagesize: sort by pixel width and height in descenging order
  • width: sort by pixel width in descending order
  • height: sort by pixel height in descending order
  • none: do not sort

--reverse

reverse sort order

--num-proc 4

number of hash calculation and ngt processes (default=cpu_count-1)

--log

output logs of duplicate and delete files (default=False)

--no-cache

not create or use image hash cache (default=False)

--no-subdir-warning

stop warnings that appear when similar images are in different subdirectories

--sameline

list each set of matches on a single line

--dry-run

dry run (do not delete any files)

--faiss-flat

use faiss exact search (IndexFlatL2) for calculating Hamming distance between hash of images (default=False)

--faiss-flat-k 20

number of searched objects when using faiss-flat (default=20)

use with imgcat (-c, --imgcat) options

--size 256x256

resize image (default=256x256)

--space 0

space between images (default=0)

--space-color black

space color between images (default=black)

--tile-num 4

horizontal tile number (default=4)

--interpolation INTER_LINEAR

interpolation methods (default=INTER_LINEAR)

You can specify OpenCV interpolation methods: INTER_NEAREST, INTER_LINEAR, INTER_AREA, INTER_CUBIC, INTER_LANCZOS4, etc.

--no-keep-aspect

do not keep aspect when displaying images

ngt options

--ngt

use NGT for calculating Hamming distance between hash of images (default=True)

--ngt-k 20

number of searched objects when using NGT. Increasing this value, improves accuracy and increases computation time. (default=20)

--ngt-epsilon 0.1

search range when using NGT. Increasing this value, improves accuracy and increases computation time. (default=0.1)

--ngt-edges 10

number of initial edges of each node at graph generation time. (default=10)

--ngt-edges-for-search 40

number of edges at search time. (default=40)

hnsw options

--hnsw

use hnsw for calculating Hamming distance between hash of images (default=False)

--hnsw-k 20

number of searched objects when using hnsw. Increasing this value, improves accuracy and increases computation time. (default=20)

--hnsw-ef-construction 100

controls index search speed/build speed tradeoff (default=100)

--hnsw-m 16

m is tightly connected with internal dimensionality of the data stronlgy affects the memory consumption (default=16)

--hnsw-ef 50

controls recall. higher ef leads to better accuracy, but slower search (default=50)

faiss options

--faiss-cuda

uses CUDA enabled device for faster searching (requires faiss-gpu, Nvidia GPU, and CUDA toolkit)
Install: https://github.com/facebookresearch/faiss/blob/master/INSTALL.md
General: https://github.com/facebookresearch/faiss/wiki/Faiss-on-the-GPU

CUDA options

--cuda-device

uses the specific CUDA device passed for CUDA accelerated searches (default=device with lowest load)
NOTE: if the device passed is not found on the system the CUDA device with the lowest load will be used

License

MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].