Possible Duplicate:
How do I create a HashCode in .net (c#) for a string that is safe to store in a database?
I'm planning to store hundreds of thousands of URLs in my database. Each row in my UrlInfo table will be immutable with the URL itself being the logical primary key. Since URL's can be fairly lengthy, I've decided to hash the URL as a quick way to find possible matches when adding new rows. The hash isn't my true key, just a way to find possible matches quickly. In addition I'm using a RegEx pattern per domain that distills the essence of the URL into something that can be compared against other URL's. I'm storing the results of the RegEx also as a hash and I'm not concerned if it yields possible duplicates.
Everything has been going great until I learned that C#'s string.GetHashCode() method, which I've been using to hash things, is not guaranteed to be unique across implementations of .Net. I noticed this when I tried moving my hash function from ASP.Net to SQLServer CLR code. The web app uses .Net 4.0 and SQLServer 2008 R2, I've learned, uses .Net 3.5. They yielded separate hash results for the same string so now I need to get away from using string.GetHashCode() because I don't want to have to worry about this changing when I upgrade my app to to future versions of .Net.
So, questions:
Does my architecture smell since I'm storing a hash in my db? Any better ways? Obviously Microsoft doesn't want me to store hash results!
Can someone recommend a good C# replacement algorithm for hashing strings? I saw Jon's here but not exactly sure how to modify to work for strings (loop through each char using ascii codes?).
Are there any string compression algorithms that would be better than using a hash algorithm?
Thanks
AWESOME RESPONSES BY MANY. THANK YOU VERY MUCH!!!
A storage class defines the scope (visibility) and life-time of variables and/or functions within a C Program. They precede the type that they modify. We have four different storage classes in a C program − auto. register.
Storage classes in C also define the lifetime of the variable and term it as 'local' or 'global'. Storage classes are also useful to define the scope or visibility, and the initial value of the variable. There are primarily four storage classes in C, viz. automatic, register, static, and external.
In C the character values are also stored as integers. In the following code, we shall put 270 into a character type data. So the binary equivalent of 270 is 100001110, but takes only first 8-bits from right. So the result will be (00001110), that is 14.
You could always use an MD5 hash instead, which is relatively fast:
public string GetUrlHash(string url) {
byte[] hash = MD5.Create().ComputeHash(Encoding.UTF8.GetBytes(url));
StringBuilder sb = new StringBuilder();
for (int i = 0; i < hash.Length; i++) {
sb.Append(hash[i].ToString("X2"));
}
return sb.ToString();
}
Call it like so:
Console.WriteLine(this.GetUrlHash("http://stackoverflow.com/questions/5355003/storing-c-gethashcode-in-db-is-unreliable"));
And get:
> 777BED7F83C66DAC111977067B4B4385
This should be fairly reliable from an uniqueness standpoint. MD5 is insecure nowadays for password applications but you don't have that problem here.
The only problem is using a string like this as a primary key on a table might be problematic, performance-wise.
The other thing you could do is use the URL shortener approach: use your database's sequence generation feature, and convert the value (make sure you use the equivalent of LONG or BIGINT!) to something like Base36, which gives you a nice, concise string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With