Up to what string length is it possible to use MD5 as a hash without having to worry about the possibility of a collision?
This would presumably be calculated by generating an MD5 hash for every possible string in a particular character set, in increasing length, until a hash appears for a second time (a collision). The maximum possible length of a string without a collision would then be one character less than the longest of the colliding pair.
Has this already been tested for MD5, SHA1, etc?
MD5: The fastest and shortest generated hash (16 bytes). The probability of just two hashes accidentally colliding is approximately: 1.47*10-29.
A collision is when you find two files to have the same hash. The research published by Wang, Feng, Lai and Yu demonstrated that MD5 fails this third requirement since they were able to generate two different messages that have the same hash.
Overview of security issues In 2004 it was shown that MD5 is not collision-resistant. As such, MD5 is not suitable for applications like SSL certificates or digital signatures that rely on this property for digital security.
Yes, MD5 always outputs the same given the same input. That's how it's used for passwords. You store the hash in the database, then when the user types their password in, it's hashed again and the two hashes are compared. NOTE: MD5 is not recommended for hashing passwords because it's cryptographically weak.
Ironically, a few weeks after I posted the previous answer, two Chinese researchers, Tao Xie and Dengguo Feng, published a new single-block collision for MD5. I was unaware of that paper until now. A single MD5 block means that the input size is 64 bytes or 512 bits. Note that the inputs are mostly the same, differing only in 2 bits.
Their methodology won't be published until January 2013, but their collision can be verified now, using numbers from the paper:
>>> from array import array >>> from hashlib import md5 >>> input1 = array('I',  [0x6165300e,0x87a79a55,0xf7c60bd0,0x34febd0b,0x6503cf04,     0x854f709e,0xfb0fc034,0x874c9c65,0x2f94cc40,0x15a12deb,0x5c15f4a3,0x490786bb,     0x6d658673,0xa4341f7d,0x8fd75920,0xefd18d5a]) >>> input2 = array('I', [x^y for x,y in zip(input1,     [0, 0, 0, 0, 0, 1<<10, 0, 0, 0, 0, 1<<31, 0, 0, 0, 0, 0])]) >>> input1 == input2 False >>> md5(input1).hexdigest() 'cee9a457e790cf20d4bdaa6d69f01e41' >>> md5(input2).hexdigest() 'cee9a457e790cf20d4bdaa6d69f01e41' Update: The paper has been published in March 2013: Tao Xie and Fanbao Liu and Dengguo Feng - Fast Collision Attack on MD5
However, if you have more room to play with, collisions of a few kilobytes are MUCH faster to calculate -- they can be calculated within hours on ANY regular computer.
The previous shortest collision used at least two MD5 blocks worth of input -- that's 128 bytes, 1024 bits. A prefix in the first block can be chosen arbitrarily by the attacker, the rest would be computed and appear as gibberish.
Here's an example of two different colliding inputs, you can try it yourself in Python:
>>> from binascii import unhexlify >>> from hashlib import md5 >>> input1 = 'Oded Goldreich\nOded Goldreich\nOded Goldreich\nOded Go' + unhexlify( ... 'd8050d0019bb9318924caa96dce35cb835b349e144e98c50c22cf461244a4064bf1afaecc582' ... '0d428ad38d6bec89a5ad51e29063dd79b16cf67c12978647f5af123de3acf844085cd025b956') >>> len(input1) 128 >>> md5(input1).hexdigest() 'd320b6433d8ebc1ac65711705721c2e1' >>> input2 = 'Neal Koblitz\nNeal Koblitz\nNeal Koblitz\nNeal Koblitz\n' + unhexlify( ... '75b80e0035f3d2c909af1baddce35cb835b349e144e88c50c22cf461244a40e4bf1afaecc582' ... '0d428ad38d6bec89a5ad51e29063dd79b16cf6fc11978647f5af123de3acf84408dcd025b956') >>> md5(input2).hexdigest() 'd320b6433d8ebc1ac65711705721c2e1' Generating these two particular inputs took 2 days on a 215-node Playstation 3 cluster, by Mark Stevens :)
The mathematics of the birthday paradox make the inflection point of probability of collision roughly around sqrt(N), where N is the number of distinct bins in the hash function, so for a 128-bit hash, as you get around 64 bits you are moderately likely to have 1 collision. So my guess is for the complete set of 8 byte strings it's somewhat likely to have a collision, and for 9 byte strings it's extremely likely.
edit: this assumes that the MD5 hash algorithm causes a mapping from input bytestring to output hash that is close to "random". (vs. one that distributes strings more evenly among the set of possible hashes, in which case it would be more close to 16 bytes.)
Also for a more specific numerical answer, if you look at one of the approximations for calculating collision probability, you get
p(k) ≈ 1 - e-k(k-1)/(2*2128) where k = the size of the space of possible inputs = 2m where the input bytestring is m bits long.
the set of 8 byte strings: p(264) ≈ 1 - e-0.5 ≈ 0.3935
the set of 9 byte strings: p(272) ≈ 1 - e-2144/(2*2128) = 1 - e-215 = 1 - e-32768 ≈ 1
Also note that these assume the complete set of m/8 byte strings. If you only use alphanumeric characters, you'd need more bytes to get a probable collision.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With