Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Uniquely identifying URLs with one 64-bit number

This is basically a math problem, but very programing related: if I have 1 billion strings containing URLs, and I take the first 64 bits of the MD5 hash of each of them, what kind of collision frequency should I expect?

How does the answer change if I only have 100 million URLs?

It seems to me that collisions will be extremely rare, but these things tend to be confusing.

Would I be better off using something other than MD5? Mind you, I'm not looking for security, just a good fast hash function. Also, native support in MySQL is nice.

EDIT: not quite a duplicate

like image 449
itsadok Avatar asked Jul 08 '09 07:07

itsadok


2 Answers

If the first 64 bits of the MD5 constituted a hash with ideal distribution, the birthday paradox would still mean you'd get collisions for every 2^32 URL's. In other words, the probability of a collision is the number of URL's divided by 4,294,967,296. See http://en.wikipedia.org/wiki/Birthday_paradox#Cast_as_a_collision_problem for details.

I wouldn't feel comfortable just throwing away half the bits in MD5; it would be better to XOR the high and low 64-bit words to give them a chance to mix. Then again, MD5 is by no means fast or secure, so I wouldn't bother with it at all. If you want blinding speed with good distribution, but no pretence of security, you could try the 64-bit versions of MurmurHash. See http://en.wikipedia.org/wiki/MurmurHash for details and code.

like image 172
Steven Sudit Avatar answered Sep 19 '22 02:09

Steven Sudit


You have tagged this as "birthday-paradox", I think you know the answer already.

P(Collision) = 1 - (2^64)!/((2^64)^n (1 - n)!)

where n is 1 billion in your case.

You will be a bit better using something other then MD5, because MD5 have pratical collusion problem.

like image 23
J-16 SDiZ Avatar answered Sep 19 '22 02:09

J-16 SDiZ