Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Need a good algorithm to categorize 8GB of pictures

Tags:

c++

image

jpeg

cimg

I have about 150.000 pictures and some of these are duplicates. I have figured that the SSIM algorithm is a good choice to compare two pictures and see if they are duplicates. However if I want to find duplicates that way, I will have to compare 150.000 * 149.999 pictures which would take forever.

So what I am looking for now is a fast and effective algorithm to create an average value for each picture and then only compare images which come close to their average value.

In short: I am looking for a effective way to categorize pictures!

I plan to use the C++ CImg library for this task because it's fast.

Thanks!

like image 686
moccajoghurt Avatar asked Dec 17 '12 19:12

moccajoghurt


2 Answers

There are pictures that vary in height but are basically the same picture that just have an unrelated box on the bottom which changes the height.

If the top of the picture is always the same for two duplicates, you could try to compute a hash value based on N lines of pixels in the image that are supposed to be pretty safe (i.e. your box in the bottom won't be in those lines).

Once you have hashed all your files, you can sort the hash values and compare more precisely only pictures with the same hash value.

like image 200
Vincent Mimoun-Prat Avatar answered Sep 21 '22 05:09

Vincent Mimoun-Prat


I'd try a hash/fingerprint like approach:

  • Generation of a fingerprint for each image, containing also relevant image attributes such as size and number of components for a metafile or a database. The fingerprint could be computed from the common subimage, this may be a lossy compressed spectrogram, a simple vector containing the frequency bins of an FFT, a histogram or another technique (I don't have a real clue what fits better, this is most probably very content dependent).

  • As littlestewie mentioned, grouping beforehand with image attributes such as size and number of color components will greatly reduce the number of (binary) comparisons, which would be (n*(n-1))/2 for each group.

  • Comparison of the fingerprints with appropriate tolerance for further sub-grouping (pay attention to cover the cases, where one image has matches in multiple groups).

  • OpenCV could do the final match:

    How to detect the Sun from the space sky in OpenCv?

Related questions regarding image comparison using OpenCV:

  • OpenCV / SURF How to generate a image hash / fingerprint / signature out of the descriptors?

  • OpenCV: Fingerprint Image and Compare Against Database

like image 22
Sam Avatar answered Sep 24 '22 05:09

Sam