We have a collection of photo images sizing a few hundred gigs. A large number of the photos are visually duplicates, but with differing filesizes, resolution, compression etc. Is it possible to use any specific image processing methods to search out and remove these duplicate images?

Try PerceptualDiff for comparing 2 images with the same dimensions. Allows threshholds such as considering images with only X number of pixels different to be visually indistinguishable. If visual duplicates may have different dimensions due to scaling, or different filetypes, you may want to make a standard format for comparisons. For example, I might use ImageMagick to scale all images to 100x100 and save them as PNG files.

A very simple approach is the following: <ul> <li>Convert the image to greyscale in memory, so every pixel is only a number between 0 (black) and 255 (white).</li> <li>Scale the image to a fixed size. Finding the right size is important, you should play around with different sizes. E.g. you could scale each image to 64x64 pixels, but you may get better or worse results with either smaller or bigger pictures.</li> <li>Once you've done this for all images (yes, that will take a while), always load two images in memory and subtract them from each other. That is subtract the value of pixel (0,0) in image A ob the value of pixel (0,0) in image B, now do the same for (0,1) in both and so on. The resulting value might be positive or negative, you should always store the absolute value (so 5 results in 5, -8 however results in 8).</li> <li>Now you have a third image being the "difference image" (delta image) of image A and B. If they were identical, the delta image is all black (all values will subtract to zero). The "less black" it is, the less identical the images are. You need to find a good threshold, since even if the images are in fact identical (to your eyes), by scaling, altering brightness and so on, the delta image will not be totally black, it will however have only very dark greytones. So you need a threshold that says "If average error (delta image brightness) is below a certain value, there is still a good chance they might be identical, however if it is above that value, they are most likely not. Finding the right threshold is as hard as finding the right scaling size. You will always have false positives (images deemed to be identical, though they are not at all) and false negatives (images deemed to be not identical, although they are).</li> </ul> This algorithm is ultra slow. Actually only creating the greyscale images takes tons of time. Then you need to compare each GS image to each other one, again, tons of time. Also storing all the GS images takes a lot of disk space. So this algorithm is very bad, but the results are not that bad, even though its that simple. While the results are not amazing, they are better than I had initially thought. The only way to get even better results is to use advanced image processing and here it starts getting really complicated. It involves a lot of math (a real lot of it); there are good applications (dupe finders) for many systems that have these implemented, so unless you must program it yourself, you are probably better off using one of these solutions. I read a lot papers on this topic but I'm afraid most of this goes beyond my horizon. Even the algorithms I might be able to implement according to these papers are beyond it; that means I understand what needs to be done, but I have no idea why it works or how it actually works, it's just magic ;-)

Removing Duplicate Images [closed]

Video Answer

3 Answers

I recently wanted to accomplish this task for a PHP image gallery. I wanted to be able to generate a "fuzzy" fingerprint for an uploaded image, and check a database for any images that had the same fingerprint, indicating they were similar, and then compare them more closely to determine how similar.

I accomplished it by resizing the uploaded image to 150 pixels wide, reducing it to greyscale, rounding the value of each colour off to the nearest multiple of 16 (giving 17 possible shades of grey between 0 and 255), normalise them and store them in an array, thereby creating a "fuzzy" colour histogram, then creating an md5sum of the histogram which I could then search for in my database. This was extremely effective in narrowing down images which were very visually similar to the uploaded file.

Then to compare the uploaded file against each "similar" image in the database, I took both images, resized them to 16x16, and analysed them pixel by pixel and took the RGB value of each pixel away from the value of the corresponding pixel in the other image, adding all the values together and dividing by the number of pixels giving me an average colour deviation. Anything less than specific value was determined to be a duplicate.

The whole thing is written in PHP using the GD module, and a comparison against thousands of images takes only a few hundred milliseconds per uploaded file.

My code, and methodology is here: http://www.catpa.ws/php-duplicate-image-finder/

120

answered Sep 30 '22 18:09

Pawz Lion

Try PerceptualDiff for comparing 2 images with the same dimensions. Allows threshholds such as considering images with only X number of pixels different to be visually indistinguishable.

If visual duplicates may have different dimensions due to scaling, or different filetypes, you may want to make a standard format for comparisons. For example, I might use ImageMagick to scale all images to 100x100 and save them as PNG files.

answered Sep 30 '22 19:09

Liam

A very simple approach is the following:

Convert the image to greyscale in memory, so every pixel is only a number between 0 (black) and 255 (white).
Scale the image to a fixed size. Finding the right size is important, you should play around with different sizes. E.g. you could scale each image to 64x64 pixels, but you may get better or worse results with either smaller or bigger pictures.
Once you've done this for all images (yes, that will take a while), always load two images in memory and subtract them from each other. That is subtract the value of pixel (0,0) in image A ob the value of pixel (0,0) in image B, now do the same for (0,1) in both and so on. The resulting value might be positive or negative, you should always store the absolute value (so 5 results in 5, -8 however results in 8).
Now you have a third image being the "difference image" (delta image) of image A and B. If they were identical, the delta image is all black (all values will subtract to zero). The "less black" it is, the less identical the images are. You need to find a good threshold, since even if the images are in fact identical (to your eyes), by scaling, altering brightness and so on, the delta image will not be totally black, it will however have only very dark greytones. So you need a threshold that says "If average error (delta image brightness) is below a certain value, there is still a good chance they might be identical, however if it is above that value, they are most likely not. Finding the right threshold is as hard as finding the right scaling size. You will always have false positives (images deemed to be identical, though they are not at all) and false negatives (images deemed to be not identical, although they are).

This algorithm is ultra slow. Actually only creating the greyscale images takes tons of time. Then you need to compare each GS image to each other one, again, tons of time. Also storing all the GS images takes a lot of disk space. So this algorithm is very bad, but the results are not that bad, even though its that simple. While the results are not amazing, they are better than I had initially thought.

The only way to get even better results is to use advanced image processing and here it starts getting really complicated. It involves a lot of math (a real lot of it); there are good applications (dupe finders) for many systems that have these implemented, so unless you must program it yourself, you are probably better off using one of these solutions. I read a lot papers on this topic but I'm afraid most of this goes beyond my horizon. Even the algorithms I might be able to implement according to these papers are beyond it; that means I understand what needs to be done, but I have no idea why it works or how it actually works, it's just magic ;-)

answered Sep 30 '22 20:09

Mecki

Related questions
                            
                                None of the constructors found with 'Autofac.Core.Activators.Reflection.DefaultConstructorFinder'
                            
                                Change the JSON serialization settings of a single ASP.NET Core controller
                            
                                How to set the output path of several visual C# projects
                            
                                Howto write a function taking variable number of arguments in F#
                            
                                Is there a way to get the stacktraces for all threads in c#, like java.lang.Thread.getAllStackTraces()?
                            
                                Stubbing / mocking a database in .Net
                            
                                what is the difference between these two methods?
                            
                                RoutePrefix vs Route
                            
                                Why does C# require parentheses when using nullables in an expression?
                            
                                Throwing exceptions in callback method for Timers
                            
                                How to find one image inside of another?
                            
                                How would I return an object or multiple values from PowerShell to executing C# code
                            
                                How different async programming is from Threads?
                            
                                Enabling Microsoft's Code Analysis on .NET Core Projects
                            
                                Poor man's "lexer" for C#
                            
                                Easy creation of properties that support indexing in C#
                            
                                System.Exception.Data Property
                            
                                What is the purpose of using CommandType.Tabledirect
                            
                                Destructuring assignment - object properties to variables in C#
                            
                                How is a Scoped service instance handled in a .NET Core Console application?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Removing Duplicate Images [closed]

Tags:

c#

algorithm

image-processing

Karl Glennon

People also ask