Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to hash a URL quickly

I have a unique situation where I need to produce hashes on the fly. Here is my situation. This question is related to here. I need to store a many urls in the database which need to be indexed. A URL can be over 2000 characters long. The database complains that a string over 900 bytes cannot be indexed. My solution is to hash the URL using MD5 or SHA256. I am not sure which hashing algorithm to use. Here are my requirements

  • Shortest character length with minimal collision
  • Needs to be very fast. I will be hashing the referurl on every page request
  • Collisions need to be minimized since I may have millions of urls in the database

I am not worried about security. I am worried about character length, speed, and collisions. Anyone know of a good algorithm for this?

like image 437
Luke101 Avatar asked Oct 18 '11 15:10

Luke101


1 Answers

In your case, I wouldn't use any of the cryptographic hash functions (i.e. MD5, SHA), since they were designed with security in mind: They mainly want to make it as hard as possible to finde two different strings with the same hash. I think this wouldn't be a problem in your case. (the possibility of random collisions is inherent to hashing, of course)

I'd strongly not suggest to use String.GetHashCode(), since the implementation is not known and MSDN says that it might vary between different versions of the framework. Even the results between x86 and x64 versions may be different. So you'll get into troubles when trying to access the same database using a newer (or different) version of the .NET framework.

I found the algorithm for the Java implementation of hashCode on Wikipedia (here), it seems quite easy to implement. Even a straightforward implementation would be faster than an implementation of MD5 or SHA imo. You could also use long values which reduces the probability of collisions.

There is also a short analysis of the .NET GetHashCode implementation here (not the algorithm itself but some implementation details), you could also use this one I guess. (or try to implement the Java version in a similar way ...)

like image 157
MartinStettner Avatar answered Oct 14 '22 09:10

MartinStettner