I'm writing a tool in C# to find duplicate images. Currently I create an MD5 checksum of the files and compare those.
Unfortunately, the images can be:
What would be the best approach to solve this problem?
Here is a simple approach with a 256 bit image-hash (MD5 has 128 bit)
List<bool>
- this is the hash Code:
public static List<bool> GetHash(Bitmap bmpSource) { List<bool> lResult = new List<bool>(); //create new image with 16x16 pixel Bitmap bmpMin = new Bitmap(bmpSource, new Size(16, 16)); for (int j = 0; j < bmpMin.Height; j++) { for (int i = 0; i < bmpMin.Width; i++) { //reduce colors to true / false lResult.Add(bmpMin.GetPixel(i, j).GetBrightness() < 0.5f); } } return lResult; }
I know, GetPixel
is not that fast but on a 16x16 pixel image it should not be the bottleneck.
Code:
List<bool> iHash1 = GetHash(new Bitmap(@"C:\mykoala1.jpg")); List<bool> iHash2 = GetHash(new Bitmap(@"C:\mykoala2.jpg")); //determine the number of equal pixel (x of 256) int equalElements = iHash1.Zip(iHash2, (i, j) => i == j).Count(eq => eq);
So this code is able to find equal images with:
i
and j
Update / Improvements:
after using this method for a while I noticed a few improvements that can be done
GetPixel
for more performance0.5f
to differ between light and dark - use the distinct median brightness of all 256 pixels. Otherwise dark/light images are assumed to be the same and it enables to detect images which have a changed brightness.bool[]
or List<bool>
if you need to store a lot hashes with the need to save memory, use a Bitarray
because a Boolean isn't stored in a bit, it takes a byte!If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With