MCPcopy
hub / github.com/idealo/imagededup

github.com/idealo/imagededup @v03.3 sqlite

repository ↗ · DeepWiki ↗ · release v03.3 ↗
359 symbols 1,343 edges 37 files 95 documented · 26%
README

Image Deduplicator (imagededup)

Build Status Docs codecov PyPI Version License

imagededup is a python package that simplifies the task of finding exact and near duplicates in an image collection.

This package provides functionality to make use of hashing algorithms that are particularly good at finding exact duplicates as well as convolutional neural networks which are also adept at finding near duplicates. An evaluation framework is also provided to judge the quality of deduplication for a given dataset.

Following details the functionality provided by the package:

  • Finding duplicates in a directory using one of the following algorithms:
  • Convolutional Neural Network (CNN) - Select from several prepackaged models or provide your own custom model.
  • Perceptual hashing (PHash)
  • Difference hashing (DHash)
  • Wavelet hashing (WHash)
  • Average hashing (AHash)
  • Generation of encodings for images using one of the above stated algorithms.
  • Framework to evaluate effectiveness of deduplication given a ground truth mapping.
  • Plotting duplicates found for a given image file.

Detailed documentation for the package can be found at: https://idealo.github.io/imagededup/

imagededup is compatible with Python 3.8+ and runs on Linux, MacOS X and Windows. It is distributed under the Apache 2.0 license.

📖 Contents

⚙️ Installation

There are two ways to install imagededup:

  • Install imagededup from PyPI (recommended):
pip install imagededup
  • Install imagededup from the GitHub source:
git clone https://github.com/idealo/imagededup.git
cd imagededup
pip install "cython>=0.29"
python setup.py install

🚀 Quick Start

In order to find duplicates in an image directory using perceptual hashing, following workflow can be used:

  • Import perceptual hashing method
from imagededup.methods import PHash
phasher = PHash()
  • Generate encodings for all images in an image directory
encodings = phasher.encode_images(image_dir='path/to/image/directory')
  • Find duplicates using the generated encodings
duplicates = phasher.find_duplicates(encoding_map=encodings)
  • Plot duplicates obtained for a given file (eg: 'ukbench00120.jpg') using the duplicates dictionary
from imagededup.utils import plot_duplicates
plot_duplicates(image_dir='path/to/image/directory',
                duplicate_map=duplicates,
                filename='ukbench00120.jpg')

The output looks as below:

The complete code for the workflow is:

from imagededup.methods import PHash
phasher = PHash()

# Generate encodings for all images in an image directory
encodings = phasher.encode_images(image_dir='path/to/image/directory')

# Find duplicates using the generated encodings
duplicates = phasher.find_duplicates(encoding_map=encodings)

# plot duplicates obtained for a given file using the duplicates dictionary
from imagededup.utils import plot_duplicates
plot_duplicates(image_dir='path/to/image/directory',
                duplicate_map=duplicates,
                filename='ukbench00120.jpg')

It is also possible to use your own custom models for finding duplicates using the CNN method.

For examples, refer this part of the repository.

For more detailed usage of the package functionality, refer: https://idealo.github.io/imagededup/

⏳ Benchmarks

Update: Provided benchmarks are only valid upto imagededup v0.2.2. The next releases have significant changes to all methods, so the current benchmarks may not hold.

Detailed benchmarks on speed and classification metrics for different methods have been provided in the documentation. Generally speaking, following conclusions can be made:

  • CNN works best for near duplicates and datasets containing transformations.
  • All deduplication methods fare well on datasets containing exact duplicates, but Difference hashing is the fastest.

🤝 Contribute

We welcome all kinds of contributions. See the Contribution guide for more details.

📝 Citation

Please cite Imagededup in your publications if this is useful for your research. Here is an example BibTeX entry:

@misc{idealods2019imagededup,
  title={Imagededup},
  author={Tanuj Jain and Christopher Lennan and Zubin John and Dat Tran},
  year={2019},
  howpublished={\url{https://github.com/idealo/imagededup}},
}

🏗 Maintainers

© Copyright

See LICENSE for details.

Core symbols most depended-on inside this repo

encode_images
called by 29
imagededup/methods/cnn.py
find_duplicates
called by 28
imagededup/methods/cnn.py
encode_image
called by 20
imagededup/methods/cnn.py
retrieve_results
called by 14
imagededup/handlers/search/retrieval.py
find_duplicates_to_remove
called by 13
imagededup/methods/cnn.py
load_image
called by 10
imagededup/utils/image_utils.py
_find_duplicates_dict
called by 9
imagededup/methods/cnn.py
evaluate
called by 9
imagededup/evaluation/evaluation.py

Shape

Function 271
Method 69
Class 19

Languages

Python100%

Modules by API surface

tests/test_hashing.py71 symbols
tests/test_cnn.py55 symbols
imagededup/methods/hashing.py25 symbols
tests/test_image_utils.py23 symbols
tests/test_information_retrieval.py15 symbols
tests/test_evaluator.py15 symbols
tests/test_bktree.py14 symbols
tests/test_plotter.py13 symbols
imagededup/methods/cnn.py13 symbols
tests/test_retrieval.py12 symbols
mkdocs/autogen.py11 symbols
imagededup/utils/models.py11 symbols

Dependencies from manifests, versioned

Pillow9.0 · 1×
cython0.29 · 1×

For agents

$ claude mcp add imagededup \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact