I've got an image library on Amazon S3. For each image, I md5 the source URL on my server plus a timestamp to get a unique filename. Since S3 can't have subdirectories, I need to store all of these images in a single flat folder.
Do I need to worry about collisions in the MD5 hash value that gets produced?
Bonus: How many files could I have before I'd start seeing collisions in the hash value that MD5 produces?
MD5: The fastest and shortest generated hash (16 bytes). The probability of just two hashes accidentally colliding is approximately: 1.47*10-29.
In 2004 it was shown that MD5 is not collision-resistant. As such, MD5 is not suitable for applications like SSL certificates or digital signatures that rely on this property for digital security.
The expected number of collisions (assuming that the hash function can be modeled as a random function) is precisely 2−n(m2); that is, the expected number of pairs of values x≠y with H(x)=H(y) (and so, to answer Ricky's question, H(x)=H(y)=H(z) would count as three collisions).
For example, the MD5 hash is always 128 bits long (commonly represented as 16 hexadecimal bytes). Thus, there are 2^128 possible MD5 hashes.
Probability of just two hashes accidentally colliding is 1/2128 which is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456.
However if you keep all the hashes then the probability is a bit higher thanks to birthday paradox. To have a 50% chance of any hash colliding with any other hash you need 264 hashes. This means that to get a collision, on average, you'll need to hash 6 billion files per second for 100 years.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With