I want to truncate an md5 hash to about half size. How much does that increase the odds of collisions? if I'm dealing with around 500 000 generations, should I be worried about a collision? what about 1m generations.
MD5: The fastest and shortest generated hash (16 bytes). The probability of just two hashes accidentally colliding is approximately: 1.47*10-29.
For all elements to collide, the elements should be equal to twice the total number of hash values. If we hash M values and total possible hash values is T, then the expected number of collisions will be C = M * (M-1) / 2T.
Yes. Truncating the output of a hash function always decreases its (theoretical) collision-resistance. In practice, it usually doesn't matter too much; for instance, 280 time is still pretty big. Still, if you used the full output of SHA-256, the same birthday attack would take 2128 time, which is totally out of reach.
A hash collision is created when we take two different inputs of data, and then create the same hash. One way of doing with is to search for two data elements and add random data in order to find the same hash.
The math you're looking for is on Wikipedia's birthday attack page.
We consider the following experiment. From a set of H values we choose n values uniformly at random thereby allowing repetitions. Let p(n; H) be the probability that during this experiment at least one value is chosen more than once. This probability can be approximated as
With 128 bits the chance of a collision among 500,000 hash values is around 10-28. If you halve the size of the collision space then the chance of collision is around 10-9. That is, even though the chance is vastly greater it's still very, very low. It depends on how critical it is that there be no collisions. 10-9 is on the order of one in a billion, so while extremely unlikely it's within the realm of possibility.
For reference:
1028 = 10 octillion = 10 billion billion billion
109 = 1 billion
There's an interesting mathematical problem called the birthday problem that deals with that kind of situation. The fact is that the more entries you push in, the higher the chances to have a collision.
Following the table posted on the above link, assuming your digests are 64 bits each (since a single MD5 hash is 128 bits) and that MD5 have a uniform distribution, there is a very low chance that two hashes will collide. It becomes significant (1% chance or more) at 610,000,000 entries.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With