How is GetHashCode() of C# string implemented?

Tags:

I'm just curious because I guess it will have impact on performance. Does it consider the full string? If yes, it will be slow on long string. If it only consider part of the string, it will have bad performance (e.g. if it only consider the beginning of the string, it will have bad performance if a HashSet contains mostly strings with the same.

752

asked Mar 02 '13 12:03

Louis Rhys

1 Answers

Be sure to obtain the Reference Source source code when you have questions like this. There's a lot more to it than what you can see from a decompiler. Pick the one that matches your preferred .NET target, the method has changed a great deal between versions. I'll just reproduce the .NET 4.5 version of it here, retrieved from Source.NET 4.5\4.6.0.0\net\clr\src\BCL\System\String.cs\604718\String.cs

        public override int GetHashCode() {   #if FEATURE_RANDOMIZED_STRING_HASHING             if(HashHelpers.s_UseRandomizedStringHashing)             {                  return InternalMarvin32HashString(this, this.Length, 0);             }  #endif // FEATURE_RANDOMIZED_STRING_HASHING               unsafe {                  fixed (char *src = this) {                     Contract.Assert(src[this.Length] == '\0', "src[this.Length] == '\\0'");                     Contract.Assert( ((int)src)%4 == 0, "Managed string should start at 4 bytes boundary");  #if WIN32                     int hash1 = (5381<<16) + 5381;  #else                      int hash1 = 5381; #endif                      int hash2 = hash1;  #if WIN32                     // 32 bit machines.                      int* pint = (int *)src;                     int len = this.Length;                      while (len > 2)                      {                         hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];                          hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];                         pint += 2;                         len  -= 4;                     }                       if (len > 0)                      {                          hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];                     }  #else                     int     c;                     char *s = src;                     while ((c = s[0]) != 0) {                          hash1 = ((hash1 << 5) + hash1) ^ c;                         c = s[1];                          if (c == 0)                              break;                         hash2 = ((hash2 << 5) + hash2) ^ c;                          s += 2;                     } #endif #if DEBUG                      // We want to ensure we can change our hash function daily.                     // This is perfectly fine as long as you don't persist the                      // value from GetHashCode to disk or count on String A                      // hashing before string B.  Those are bugs in your code.                     hash1 ^= ThisAssembly.DailyBuildNumber;  #endif                     return hash1 + (hash2 * 1566083941);                 }             }          }

This is possibly more than you bargained for, I'll annotate the code a bit:

The #if conditional compilation directives adapt this code to different .NET targets. The FEATURE_XX identifiers are defined elsewhere and turn features off whole sale throughout the .NET source code. WIN32 is defined when the target is the 32-bit version of the framework, the 64-bit version of mscorlib.dll is built separately and stored in a different subdirectory of the GAC.
The s_UseRandomizedStringHashing variable enables a secure version of the hashing algorithm, designed to keep programmers out of trouble that do something unwise like using GetHashCode() to generate hashes for things like passwords or encryption. It is enabled by an entry in the app.exe.config file
The fixed statement keeps indexing the string cheap, avoids the bounds checking done by the regular indexer
The first Assert ensures that the string is zero-terminated as it should be, required to allow the optimization in the loop
The second Assert ensures that the string is aligned to an address that's a multiple of 4 as it should be, required to keep the loop performant
The loop is unrolled by hand, consuming 4 characters per loop for the 32-bit version. The cast to int* is a trick to store 2 characters (2 x 16 bits) in a int (32-bits). The extra statements after the loop deal with a string whose length is not a multiple of 4. Note that the zero terminator may or may not be included in the hash, it won't be if the length is even. It looks at all the characters in the string, answering your question
The 64-bit version of the loop is done differently, hand-unrolled by 2. Note that it terminates early on an embedded zero, so doesn't look at all the characters. Otherwise very uncommon. That's pretty odd, I can only guess that this has something to do with strings potentially being very large. But can't think of a practical example
The debug code at the end ensures that no code in the framework ever takes a dependency on the hash code being reproducible between runs.
The hash algorithm is pretty standard. The value 1566083941 is a magic number, a prime that is common in a Mersenne twister.

answered Sep 20 '22 15:09

Hans Passant

Related questions
                            
                                Retrieve an object from entityframework without ONE field
                            
                                Performance of "direct" virtual call vs. interface call in C#
                            
                                How to get the values of a ConfigurationSection of type NameValueSectionHandler
                            
                                Moq ReturnsAsync() with parameters
                            
                                Is it bad practice to write inline event handlers
                            
                                C#: Waiting for all threads to complete
                            
                                Is there any graph data structure implemented for C#
                            
                                Get all files and directories in specific path fast
                            
                                What is the difference between await Task<T> and Task<T>.Result?
                            
                                How do I open SSRS (.rptproj) files in Visual Studio 2013?
                            
                                How to get csc.exe path?
                            
                                Should "Dispose" only be used for types containing unmanaged resources?
                            
                                Add client certificate to .NET Core HttpClient
                            
                                Where can I learn about MEF? [closed]
                            
                                Optional return in C#.Net
                            
                                How to extract custom header value?
                            
                                How to correctly unregister an event handler
                            
                                MVVM: Binding radio buttons to a view model?
                            
                                C# Data Structure Like Dictionary But Without A Value
                            
                                Overloading assignment operator in C#

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How is GetHashCode() of C# string implemented?

Tags:

string

c#

.net

gethashcode

hash

Louis Rhys

People also ask

1 Answers

Hans Passant

Recent Activity

Donate For Us