Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate integer based on any given string (without GetHashCode)

I'm attempting to write a method to generate an integer based on any given string. When calling this method on 2 identical strings, I need the method to generate the same exact integer both times.

I tried using .GetHasCode() however this is very unreliable once I move the project to another machine, as GetHasCode() returns different values for the same string

It is also important that the collision rate be VERY low. Custom methods I have written thus far produce collisions after just a few hundred thousand records.

The hash value MUST be an integer. A string hash value (like md5) would cripple my project in terms of speed and loading overhead.

The integer hashes are being used to perform extremely rapid text searches, which I have working beautifully, however it currently relies on .GetHasCode() and doesn't work when multiple machines get involved.

Any insight at all would be greatly appreciated.

like image 231
mrb398 Avatar asked Nov 11 '14 17:11

mrb398


People also ask

What is GetHashCode in powershell?

GetHashCode(ReadOnlySpan<Char>) Returns the hash code for the provided read-only character span.

What is the return type of the string GetHashCode () in C #?

GetHashCode() method is used to get the hash code of the specified string. When you apply this method to the string this method will return a 32-bit signed integer hash code of the given string.

How does GetHashCode work in C#?

GetHashCode method of the base class uses reflection to compute the hash code based on the values of the type's fields. In other words, value types whose fields have equal values have equal hash codes.

Is Hashcode unique in C#?

A hash code is not an id, and it doesn't return a unique value. This is kind of obvious, when you think about it: GetHashCode returns an Int32 , which has “only” about 4.2 billion possible values, and there's potentially an infinity of different objects, so some of them are bound to have the same hash code.


3 Answers

MD5 hashing returns a byte array which could be converted to an integer:

var mystring = "abcd";
MD5 md5Hasher = MD5.Create();
var hashed = md5Hasher.ComputeHash(Encoding.UTF8.GetBytes(mystring));
var ivalue = BitConverter.ToInt32(hashed, 0);

Of course, you are converting from a 128 bit hash to a 32 bit int, so some information is being lost which will increase the possibility of collisions. You could try adjusting the second parameter to ToInt32 to see if any specific ranges of the MD5 hash produce fewer collisions than others for your data.

like image 93
Rudism Avatar answered Oct 24 '22 02:10

Rudism


If your hash code creates duplicates "after a few hundred thousand records," you have a pretty good hash code implementation.

If you do the math, you'll find that a 32-bit hash code has a 50% chance of creating a duplicate after about 70,000 records. The probability of generating a duplicate after a million records is so close to certainty as not to matter.

As a rule of thumb, the likelihood of generating a duplicate hash code is 50% when the number of records hashed is equal to the square root of the number of possible values. So with a 32 bit hash code that has 2^32 possible values, the chance of generating a duplicate is 50% after approximately 2^16 (65,536) values. The actual number is slightly larger--closer to 70,000--but the rule of thumb gets you in the ballpark.

Another rule of thumb is that the chance of generating a duplicate is nearly 100% when the number of items hashed is four times the square root. So with a 32-bit hash code you're almost guaranteed to get a collision after only 2^18 (262,144) records hashed.

That's not going to change if you use the MD5 and convert it from 128 bits to 32 bits.

like image 23
Jim Mischel Avatar answered Oct 24 '22 01:10

Jim Mischel


This code map any string to int between 0-100

int x= "ali".ToCharArray().Sum(x => x)%100;
like image 2
Roohi Ali Avatar answered Oct 24 '22 01:10

Roohi Ali