I have about 150.000 pictures and some of these are duplicates. I have figured that the SSIM algorithm is a good choice to compare two pictures and see if they are duplicates. However if I want to find duplicates that way, I will have to compare 150.000 * 149.999 pictures which would take forever.
So what I am looking for now is a fast and effective algorithm to create an average value for each picture and then only compare images which come close to their average value.
In short: I am looking for a effective way to categorize pictures!
I plan to use the C++ CImg library for this task because it's fast.
Thanks!
There are pictures that vary in height but are basically the same picture that just have an unrelated box on the bottom which changes the height.
If the top of the picture is always the same for two duplicates, you could try to compute a hash value based on N lines of pixels in the image that are supposed to be pretty safe (i.e. your box in the bottom won't be in those lines).
Once you have hashed all your files, you can sort the hash values and compare more precisely only pictures with the same hash value.
I'd try a hash/fingerprint like approach:
Generation of a fingerprint for each image, containing also relevant image attributes such as size and number of components for a metafile or a database. The fingerprint could be computed from the common subimage, this may be a lossy compressed spectrogram, a simple vector containing the frequency bins of an FFT, a histogram or another technique (I don't have a real clue what fits better, this is most probably very content dependent).
As littlestewie mentioned, grouping beforehand with image attributes such as size and number of color components will greatly reduce the number of (binary) comparisons, which would be (n*(n-1))/2
for each group.
Comparison of the fingerprints with appropriate tolerance for further sub-grouping (pay attention to cover the cases, where one image has matches in multiple groups).
OpenCV could do the final match:
How to detect the Sun from the space sky in OpenCv?
Related questions regarding image comparison using OpenCV:
OpenCV / SURF How to generate a image hash / fingerprint / signature out of the descriptors?
OpenCV: Fingerprint Image and Compare Against Database
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With