I'm writing a tool in C# to find duplicate images. Currently I create an MD5 checksum of the files and compare those.
Unfortunately, the images can be:


What would be the best approach to solve this problem?
Here is a simple approach with a 256 bit image-hash (MD5 has 128 bit)


List<bool> - this is the hash  Code:
public static List<bool> GetHash(Bitmap bmpSource) {     List<bool> lResult = new List<bool>();              //create new image with 16x16 pixel     Bitmap bmpMin = new Bitmap(bmpSource, new Size(16, 16));     for (int j = 0; j < bmpMin.Height; j++)     {         for (int i = 0; i < bmpMin.Width; i++)         {             //reduce colors to true / false                             lResult.Add(bmpMin.GetPixel(i, j).GetBrightness() < 0.5f);         }                  }     return lResult; }   I know, GetPixel is not that fast but on a 16x16 pixel image it should not be the bottleneck.
Code:
List<bool> iHash1 = GetHash(new Bitmap(@"C:\mykoala1.jpg")); List<bool> iHash2 = GetHash(new Bitmap(@"C:\mykoala2.jpg"));  //determine the number of equal pixel (x of 256) int equalElements = iHash1.Zip(iHash2, (i, j) => i == j).Count(eq => eq);   So this code is able to find equal images with:
i and j Update / Improvements:
after using this method for a while I noticed a few improvements that can be done
GetPixel for more performance0.5f to differ between light and dark - use the distinct median brightness of all 256 pixels. Otherwise dark/light images are assumed to be the same and it enables to detect images which have a changed brightness.bool[] or List<bool> if you need to store a lot hashes with the need to save memory, use a Bitarray because a Boolean isn't stored in a bit, it takes a byte!If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With