Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are hash collisions with different file sizes just as likely as same file size?

I'm hashing a large number of files, and to avoid hash collisions, I'm also storing a file's original size - that way, even if there's a hash collision, it's extremely unlikely that the file sizes will also be identical. Is this sound (a hash collision is equally likely to be of any size), or do I need another piece of information (if a collision is more likely to also be the same length as the original).

Or, more generally: Is every file just as likely to produce a particular hash, regardless of original file size?

like image 496
SqlRyan Avatar asked Mar 14 '10 15:03

SqlRyan


People also ask

Can 2 different files have the same hash?

Is it possible? Yes. There are an infinite number of binary files, but only a finite number of md5 hashes (since they have fixed size) hence there are infinitely many files that have the same hash. This is true for all hashing functions.

What happens when you vary the size of a hash table?

Changing the size of the hash table will change the compression function hence leading to the keys being allotted to different buckets.

How can you reduce the chance of a hash collision?

If two records are being directed to the same cell, both would go into that cell as a linked list. This efficiently prevents a hash collision from occurring since records with the same hash values can go into the same cell, but it has its disadvantages.

Why are the hashes of the same file different?

Hashes explained The simplest way to think of it is as a (nearly) unique fingerprint of the binary data held within the file. The hash for a given file won't change if you change the name of the file (or any of its properties) but will be completely different if you change so much as one byte inside the file itself.


1 Answers

Hash functions are generally written to evenly distribute the data across all result buckets.

If you assume that your files are evenly distributed over a fixed range of available sizes, lets say that there are only 1024 (2^10) evenly distributed distinct sizes for your files. Storing file size at best only reduces the chance of a collision by the number of distinct file sizes.

Note: we could assume it's 2^32 evenly distributed and distinct sizes and it still doesn't change the rest of the math.

It is commonly accepted that the general probability of a collision on MD5 (for example) is 1/(2^128).

Unless there is something that is specifically built into a hash function that says otherwise. Given any valid X such that Probability of P(MD5(X) == MD5(X+1)) remains the same as any two random values {Y, Z} That is to say that P(MD5(Y) == MD5(Z)) = P(MD5(X) == MD5(X+1)) = 1/(2^128) for any values of X, Y and Z.

Combining this with the 2^10 of distinct files means that by storing file size you are at most getting an additional 10 bits that signify if items are different or not (again this is assuming your files are evenly distributed for all values).

So at the very best all you are doing is adding another N bytes of storage for <=N bytes worth of unique values (it can never be >N). Therefore you're much better off to increase the bytes returned by your hash function using something such as SHA-1/2 instead as this will be more likely to give you an evenly distributed data of hash values than storing the file size.

In short, if MD5 isn't good enough for collisions use a stronger hash, if the stronger hashes are too slow then use a fast hash with low chance of collisions such a as MD5, and then use a slower hash such as SHA-1 or SHA256 to reduce the chance of a collision, but if SHA256 is fast enough and the doubled space isn't a problem then you probably should be using SHA256.

like image 97
Seph Avatar answered Oct 09 '22 10:10

Seph