Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does comparing images through md5 work?

Does this method compare the pixel values of the images? I'm guessing it won't work because they are different sizes from each other but what if they are identical, but in different formats? For example, I took a screenshot and saved as a .jpg and another and saved as a .gif.

like image 895
TreeTree Avatar asked Jan 31 '11 16:01

TreeTree


4 Answers

An MD5 hash is of the actual binary data, so different formats will have completely different binary data.

so for MD5 hashes to match, they must be identical files. (There are exceptions in fringe cases.)

This is actually one way forensic law enforcement finds data it deems as contraband. (in reference to images)

like image 75
jondavidjohn Avatar answered Oct 21 '22 06:10

jondavidjohn


It is an MD5 Checksum - the same thing you often see when downloading a file, if the MD5 of the downloaded file matches the MD5 given by the provider, then the file transfer was successful. http://en.wikipedia.org/wiki/Checksum If there is even 1 bit of difference between the 2 files then the resulting hash will be completely different.

Due to the difference in encoding between a JPG and GIF, the 2 will not have the same MD5 hash.

like image 39
Gazler Avatar answered Oct 21 '22 05:10

Gazler


md5 is a hash algorithm, so it does not compare images but it compares data. The data you put in can be nearly anything, like the contents of a file. It then outputs a hashstring based on the contents, which is the raw data of the file.

So you basically do not compare images when feeding the image into md5 but the raw data of the image. The hash algorithm does not know anything about it but the raw data, so a jpg and an gif (or any other image format) of the same screenshot will never be the same.

Even if you compare the decoded image it will not put out the same hash but will have small differences the human eye cannot see (depending on the amount of compression used). This might be different when comparing the decoded data of lossless encoded images, but I don't know here.

Take a look at the wikipedia article for a more detailed explanation and technical background about hash functions.

like image 30
Markus Avatar answered Oct 21 '22 06:10

Markus


A .jpg file starts with 'JFIF', a .gif starts with 'GIF' when you look at the raw bytes. In otherwords, comparing the on-disk bytes of the "same image" in two different format is pretty much guaranteed to produce two different MD5 hashes, since the file's contents differ - even if the actual image is the "same picture".

To do a hash-based image comparison, you have to compare two images using the same format. It would be very very difficult to produce a .jpg and a .gif of the same image that would compare equal if you converted them to (say) a .bmp. It'd be the same fileformat, but the internal requirements of .gif (8bit, RLE/LZW lossless compression) v.s. the internal requirements of .jpg (24bit, lossy discrete cosine transform compression) mean it's nigh-on impossible to get the same .bmp from both source images.

like image 43
Marc B Avatar answered Oct 21 '22 06:10

Marc B