Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing Duplicate Images [closed]

We have a collection of photo images sizing a few hundred gigs. A large number of the photos are visually duplicates, but with differing filesizes, resolution, compression etc.

Is it possible to use any specific image processing methods to search out and remove these duplicate images?

like image 708
Karl Glennon Avatar asked Oct 22 '08 10:10

Karl Glennon


People also ask

Does Windows 10 have a duplicate photo finder?

Does Windows 10 have a built-in duplicate file finder app? No, Windows 10 doesn't have a built-in file finder. But, you can do this manually through the Windows photos app. You can also download duplicate file remover and run it.


Video Answer


3 Answers

I recently wanted to accomplish this task for a PHP image gallery. I wanted to be able to generate a "fuzzy" fingerprint for an uploaded image, and check a database for any images that had the same fingerprint, indicating they were similar, and then compare them more closely to determine how similar.

I accomplished it by resizing the uploaded image to 150 pixels wide, reducing it to greyscale, rounding the value of each colour off to the nearest multiple of 16 (giving 17 possible shades of grey between 0 and 255), normalise them and store them in an array, thereby creating a "fuzzy" colour histogram, then creating an md5sum of the histogram which I could then search for in my database. This was extremely effective in narrowing down images which were very visually similar to the uploaded file.

Then to compare the uploaded file against each "similar" image in the database, I took both images, resized them to 16x16, and analysed them pixel by pixel and took the RGB value of each pixel away from the value of the corresponding pixel in the other image, adding all the values together and dividing by the number of pixels giving me an average colour deviation. Anything less than specific value was determined to be a duplicate.

The whole thing is written in PHP using the GD module, and a comparison against thousands of images takes only a few hundred milliseconds per uploaded file.

My code, and methodology is here: http://www.catpa.ws/php-duplicate-image-finder/

like image 120
Pawz Lion Avatar answered Sep 30 '22 18:09

Pawz Lion


Try PerceptualDiff for comparing 2 images with the same dimensions. Allows threshholds such as considering images with only X number of pixels different to be visually indistinguishable.

If visual duplicates may have different dimensions due to scaling, or different filetypes, you may want to make a standard format for comparisons. For example, I might use ImageMagick to scale all images to 100x100 and save them as PNG files.

like image 42
Liam Avatar answered Sep 30 '22 19:09

Liam


A very simple approach is the following:

  • Convert the image to greyscale in memory, so every pixel is only a number between 0 (black) and 255 (white).

  • Scale the image to a fixed size. Finding the right size is important, you should play around with different sizes. E.g. you could scale each image to 64x64 pixels, but you may get better or worse results with either smaller or bigger pictures.

  • Once you've done this for all images (yes, that will take a while), always load two images in memory and subtract them from each other. That is subtract the value of pixel (0,0) in image A ob the value of pixel (0,0) in image B, now do the same for (0,1) in both and so on. The resulting value might be positive or negative, you should always store the absolute value (so 5 results in 5, -8 however results in 8).

  • Now you have a third image being the "difference image" (delta image) of image A and B. If they were identical, the delta image is all black (all values will subtract to zero). The "less black" it is, the less identical the images are. You need to find a good threshold, since even if the images are in fact identical (to your eyes), by scaling, altering brightness and so on, the delta image will not be totally black, it will however have only very dark greytones. So you need a threshold that says "If average error (delta image brightness) is below a certain value, there is still a good chance they might be identical, however if it is above that value, they are most likely not. Finding the right threshold is as hard as finding the right scaling size. You will always have false positives (images deemed to be identical, though they are not at all) and false negatives (images deemed to be not identical, although they are).

This algorithm is ultra slow. Actually only creating the greyscale images takes tons of time. Then you need to compare each GS image to each other one, again, tons of time. Also storing all the GS images takes a lot of disk space. So this algorithm is very bad, but the results are not that bad, even though its that simple. While the results are not amazing, they are better than I had initially thought.

The only way to get even better results is to use advanced image processing and here it starts getting really complicated. It involves a lot of math (a real lot of it); there are good applications (dupe finders) for many systems that have these implemented, so unless you must program it yourself, you are probably better off using one of these solutions. I read a lot papers on this topic but I'm afraid most of this goes beyond my horizon. Even the algorithms I might be able to implement according to these papers are beyond it; that means I understand what needs to be done, but I have no idea why it works or how it actually works, it's just magic ;-)

like image 43
Mecki Avatar answered Sep 30 '22 20:09

Mecki