I have a unique situation where I need to produce hashes on the fly. Here is my situation. This question is related to here. I need to store a many urls in the database which need to be indexed. A URL can be over 2000 characters long. The database complains that a string over 900 bytes cannot be indexed. My solution is to hash the URL using MD5 or SHA256. I am not sure which hashing algorithm to use. Here are my requirements
I am not worried about security. I am worried about character length, speed, and collisions. Anyone know of a good algorithm for this?
In your case, I wouldn't use any of the cryptographic hash functions (i.e. MD5, SHA), since they were designed with security in mind: They mainly want to make it as hard as possible to finde two different strings with the same hash. I think this wouldn't be a problem in your case. (the possibility of random collisions is inherent to hashing, of course)
I'd strongly not suggest to use String.GetHashCode()
, since the implementation is not known and MSDN says that it might vary between different versions of the framework. Even the results between x86 and x64 versions may be different. So you'll get into troubles when trying to access the same database using a newer (or different) version of the .NET framework.
I found the algorithm for the Java implementation of hashCode
on Wikipedia (here), it seems quite easy to implement. Even a straightforward implementation would be faster than an implementation of MD5 or SHA imo. You could also use long
values which reduces the probability of collisions.
There is also a short analysis of the .NET GetHashCode
implementation here (not the algorithm itself but some implementation details), you could also use this one I guess. (or try to implement the Java version in a similar way ...)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With