Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Probability of getting a duplicate value when calling GetHashCode() on strings

Tags:

I want to know the probability of getting duplicate values when calling the GetHashCode() method on string instances. For instance, according to this blog post, blair and brainlessness have the same hashcode (1758039503) on an x86 machine.

like image 299
Diego Avatar asked Nov 01 '11 15:11

Diego


People also ask

Is string GetHashCode unique?

If two string objects are equal, the GetHashCode method returns identical values. However, there is not a unique hash code value for each unique string value. Different strings can return the same hash code. The hash code itself is not guaranteed to be stable.

What is the purpose of GetHashCode?

The GetHashCode method provides this hash code for algorithms that need quick checks of object equality. For information about how hash codes are used in hash tables and for some additional hash code algorithms, see the Hash Function entry in Wikipedia. Two objects that are equal return hash codes that are equal.

Is GetHashCode unique C#?

NO! A hash code is not an id, and it doesn't return a unique value. This is kind of obvious, when you think about it: GetHashCode returns an Int32 , which has “only” about 4.2 billion possible values, and there's potentially an infinity of different objects, so some of them are bound to have the same hash code.


2 Answers

Large.

(Sorry Jon!)

The probability of getting a hash collision among short strings is extremely large. Given a set of only ten thousand distinct short strings drawn from common words, the probability of there being at least one collision in the set is approximately 1%. If you have eighty thousand strings, the probability of there being at least one collision is over 50%.

For a graph showing the relationship between set size and probability of collision, see my article on the subject:

https://docs.microsoft.com/en-us/archive/blogs/ericlippert/socks-birthdays-and-hash-collisions

like image 188
Eric Lippert Avatar answered Sep 19 '22 15:09

Eric Lippert


Small - if you're talking about the chance of any two arbitrary unequal strings having a collision. (It will depend on just how "arbitrary" the strings are, of course - different contexts will be using different strings.)

Large - if you're talking about the chance of there being at least one collision in a large pool of arbitrary strings. The small individual probabilities are no match for the birthday problem.

That's about all you need to know. There are definitely cases where there will be collisions, and there have to be given that there are only 232 possible hash codes, and more than that many strings - so the pigeonhole principle proves that at least one hash code must have more than one string which generates it. However, you should trust that the hash has been designed to be pretty reasonable.

You can rely on it as a pretty good way of narrowing down the possible matches for a particular string. It would be an unusual set of naturally-occurring strings which generated a lot of collisions - and even when there are some collisions, obviously if you can narrow a candidate search set down from 50K to fewer than 10 strings, that's a pretty big win. But you must not rely on it as a unique value for any string.

Note that the algorithm used in .NET 4 differs between x86 and x64, so that example probably isn't valid on both platforms.

like image 38
Jon Skeet Avatar answered Sep 18 '22 15:09

Jon Skeet