Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

To check if two image files are same..Checksum or Hash?

I am doing some image processing code where in I download some images(as BufferedImage) from URLs and pass it on to a image processor.

I want to avoid passing of the same image more than once to the image processor(as the image processing operation is of high cost). The URL end points of the images(if they are same images) may vary and hence I can prevent this by the URL. So I was planning to do a checksum or hash to identify if the code is encountering the same image again.

For md5 I tried Fast MD5, and it generated a 20K+ character length hex checksum value for the image(some sample). Obviously storing this 20K+ character hash would be an issue when it comes to database storage. Hence I tried the CRC32(from java.util.zip.CRC32). And it did generate quite smaller length check sum than the hash.

I do understand checksum and hash are for different purposes. For the purpose explained above can I just use the CRC32? Would it solve the purpose or I have to try something more than these two?

Thanks, Abi

like image 477
Abhishek Avatar asked Jun 17 '11 06:06

Abhishek


2 Answers

The difference between CRC and, say, MD5, is that it is more difficult to tamper a file to match a "target" MD5 than to tamper it to match a "target" checksum. Since this does not seem a problem for your program, it should not matter which method do you use. Maybe MD5 might be a little more CPU intensive, but I do not know if that different will matter.

The main question should be the number of bytes of the digest.

If you are doing a checksum in an integer will mean that, for a file of 2K size, you are fitting 2^2048 combinations into 2^32 combinations --> for every CRC value, you will have 2^64 possible files that match it. If you have a 128 bits MD5, then you have 2^16 possible collisions.

The bigger the code that you compute, the less possible collisions (given that the codes computed are distributed evenly), so the safer the comparation.

Anyway, in order to minimice possible errors, I think the first classification should be using file size... first compare file sizes, if they match then compare checksums/hash.

like image 95
SJuan76 Avatar answered Sep 17 '22 13:09

SJuan76


A checksum and a hash are basically the same. You should be able to calculate any kind of hash. A regular MD5 would normally suffice. If you like, you could store the size and the md5 hash (which is 16 bytes, I think).

If two files have different sizes, thay are different files. You will not even need to calculate a hash over the data. If it is unlikely that you have many duplicate files, and the files are of the larger kind (like, JPG pictures taken with a camera), this optimization may spare you a lot of time.

If two or more files have the same size, you can calculate the hashes and compare them.

If two hashes are the same, you could compare the actual data to see if this is different after all. This is very, very unlikely, but theoretically possible. The larger your hash (md5 is 16 bytes, while CR32 is only 4), the less likely that two different files will have the same hash. It will take only 10 minutes of programming to perform this extra check though, so I'd say: better safe than sorry. :)

To further optimize this, if exactly two files have the same size, you can just compare their data. You will need to read the files anyway to calculate their hashes, so why not compare them directly if they are the only two with that specific size.

like image 38
GolezTrol Avatar answered Sep 21 '22 13:09

GolezTrol