Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is GetHashCode() of C# string implemented?

I'm just curious because I guess it will have impact on performance. Does it consider the full string? If yes, it will be slow on long string. If it only consider part of the string, it will have bad performance (e.g. if it only consider the beginning of the string, it will have bad performance if a HashSet contains mostly strings with the same.

like image 752
Louis Rhys Avatar asked Mar 02 '13 12:03

Louis Rhys


People also ask

How is GetHashCode used?

GetHashCode() is used to help support using the object as a key for hash tables. (A similar thing exists in Java etc). The goal is for every object to return a distinct hash code, but this often can't be absolutely guaranteed. It is required though that two logically equal objects return the same hash code.

Why do we need GetHashCode C#?

GetHashCode returns a value based on the current instance that is suited for hashing algorithms and data structures such as a hash table. Two objects that are the same type and are equal must return the same hash code to ensure that instances of System.

What is return type of GetHashCode C#?

GetHashCode() Method is used to return the hash code for this instance. Syntax: public override int GetHashCode (); Return Value: This method returns the hash code for the current instance.

Should I override GetHashCode?

Why is it important to override GetHashCode ? It s important to implement both equals and gethashcode, due to collisions, in particular while using dictionaries. if two object returns same hashcode, they are inserted in the dictionary with chaining. While accessing the item equals method is used.


1 Answers

Be sure to obtain the Reference Source source code when you have questions like this. There's a lot more to it than what you can see from a decompiler. Pick the one that matches your preferred .NET target, the method has changed a great deal between versions. I'll just reproduce the .NET 4.5 version of it here, retrieved from Source.NET 4.5\4.6.0.0\net\clr\src\BCL\System\String.cs\604718\String.cs

        public override int GetHashCode() {   #if FEATURE_RANDOMIZED_STRING_HASHING             if(HashHelpers.s_UseRandomizedStringHashing)             {                  return InternalMarvin32HashString(this, this.Length, 0);             }  #endif // FEATURE_RANDOMIZED_STRING_HASHING               unsafe {                  fixed (char *src = this) {                     Contract.Assert(src[this.Length] == '\0', "src[this.Length] == '\\0'");                     Contract.Assert( ((int)src)%4 == 0, "Managed string should start at 4 bytes boundary");  #if WIN32                     int hash1 = (5381<<16) + 5381;  #else                      int hash1 = 5381; #endif                      int hash2 = hash1;  #if WIN32                     // 32 bit machines.                      int* pint = (int *)src;                     int len = this.Length;                      while (len > 2)                      {                         hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];                          hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];                         pint += 2;                         len  -= 4;                     }                       if (len > 0)                      {                          hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];                     }  #else                     int     c;                     char *s = src;                     while ((c = s[0]) != 0) {                          hash1 = ((hash1 << 5) + hash1) ^ c;                         c = s[1];                          if (c == 0)                              break;                         hash2 = ((hash2 << 5) + hash2) ^ c;                          s += 2;                     } #endif #if DEBUG                      // We want to ensure we can change our hash function daily.                     // This is perfectly fine as long as you don't persist the                      // value from GetHashCode to disk or count on String A                      // hashing before string B.  Those are bugs in your code.                     hash1 ^= ThisAssembly.DailyBuildNumber;  #endif                     return hash1 + (hash2 * 1566083941);                 }             }          } 

This is possibly more than you bargained for, I'll annotate the code a bit:

  • The #if conditional compilation directives adapt this code to different .NET targets. The FEATURE_XX identifiers are defined elsewhere and turn features off whole sale throughout the .NET source code. WIN32 is defined when the target is the 32-bit version of the framework, the 64-bit version of mscorlib.dll is built separately and stored in a different subdirectory of the GAC.
  • The s_UseRandomizedStringHashing variable enables a secure version of the hashing algorithm, designed to keep programmers out of trouble that do something unwise like using GetHashCode() to generate hashes for things like passwords or encryption. It is enabled by an entry in the app.exe.config file
  • The fixed statement keeps indexing the string cheap, avoids the bounds checking done by the regular indexer
  • The first Assert ensures that the string is zero-terminated as it should be, required to allow the optimization in the loop
  • The second Assert ensures that the string is aligned to an address that's a multiple of 4 as it should be, required to keep the loop performant
  • The loop is unrolled by hand, consuming 4 characters per loop for the 32-bit version. The cast to int* is a trick to store 2 characters (2 x 16 bits) in a int (32-bits). The extra statements after the loop deal with a string whose length is not a multiple of 4. Note that the zero terminator may or may not be included in the hash, it won't be if the length is even. It looks at all the characters in the string, answering your question
  • The 64-bit version of the loop is done differently, hand-unrolled by 2. Note that it terminates early on an embedded zero, so doesn't look at all the characters. Otherwise very uncommon. That's pretty odd, I can only guess that this has something to do with strings potentially being very large. But can't think of a practical example
  • The debug code at the end ensures that no code in the framework ever takes a dependency on the hash code being reproducible between runs.
  • The hash algorithm is pretty standard. The value 1566083941 is a magic number, a prime that is common in a Mersenne twister.
like image 57
Hans Passant Avatar answered Sep 20 '22 15:09

Hans Passant