Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can i use GetHashCode() for all string compares?

Tags:

c#

hashcode

i want to cache some search results based on the object to search and some search settings.

However: this creates quite a long cache key, and i thought i'd create a shortcut for it, and i thought i'd use GetHashCode() for it.

So i was wondering, does GetHashCode() always generate a different number, even when i have very long strings or differ only by this: 'ä' in stead of 'a'

I tried some strings and it seemed the answer is yes, but not understanding the GetHashCode() behaviour doesn't give me the true feeling i am right.

And because it is one of those things which will pop up when you're not prepared (the client is looking at cached results for the wrong search) i want to be sure...

EDIT: if MD5 would work, i can change my code not to use the GetHashCode ofcourse, the goals is to get a short(er) string than the original (> 1000 chars)

like image 894
Michel Avatar asked Dec 01 '22 05:12

Michel


2 Answers

You CANNOT count on GetHashCode() being unique.

There is an excellent article which investigates the likelihood of collisions available at http://kenneththorman.blogspot.com/2010/09/c-net-equals-and-gethashcode.html . The findings were that "The smallest number of calls to GetHashCode() to return the same hashcode for a different string was after 565 iterations and the highest number of iterations before getting a hashcode collision was 296390 iterations. "

So that you can understand the contract for GetHashCode implementations, the following is an excerpt from MSDN documentation for Object.GetHashCode():

A hash function must have the following properties:

  • If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.

  • The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object's Equals method. Note that this is true only for the current execution of an application, and that a different hash code can be returned if the application is run again.

  • For the best performance, a hash function must generate a random distribution for all input.

Eric Lippert of the C# compiler team explains the rationale for the GetHashCode implementation rules on his blog at http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/ .

like image 160
smartcaveman Avatar answered Dec 05 '22 15:12

smartcaveman


Logically GetHashCode cannot be unique since there are only 2^32 ints and an infinite number of strings (see the pigeon hole principle).


As @Henk pointed out in the comment even though there are an infinite number of strings there are a finite number of System.Strings. However the pigeon hole principle still stands as the later is much larger than int.MaxValue.

like image 30
Motti Avatar answered Dec 05 '22 17:12

Motti