Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why might a System.String object not cache its hash code?

A glance at the source code for string.GetHashCode using Reflector reveals the following (for mscorlib.dll version 4.0):

public override unsafe int GetHashCode()
{
    fixed (char* str = ((char*) this))
    {
        char* chPtr = str;
        int num = 0x15051505;
        int num2 = num;
        int* numPtr = (int*) chPtr;
        for (int i = this.Length; i > 0; i -= 4)
        {
            num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
            if (i <= 2)
            {
                break;
            }
            num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
            numPtr += 2;
        }
        return (num + (num2 * 0x5d588b65));
    }
}

Now, I realize that the implementation of GetHashCode is not specified and is implementation-dependent, so the question "is GetHashCode implemented in the form of X or Y?" is not really answerable. I'm just curious about a few things:

  1. If Reflector has disassembled the DLL correctly and this is the implementation of GetHashCode (in my environment), am I correct in interpreting this code to indicate that a string object, based on this particular implementation, would not cache its hash code?
  2. Assuming the answer is yes, why would this be? It seems to me that the memory cost would be minimal (one more 32-bit integer, a drop in the pond compared to the size of the string itself) whereas the savings would be significant, especially in cases where, e.g., strings are used as keys in a hashtable-based collection like a Dictionary<string, [...]>. And since the string class is immutable, it isn't like the value returned by GetHashCode will ever even change.

What could I be missing?


UPDATE: In response to Andras Zoltan's closing remark:

There's also the point made in Tim's answer(+1 there). If he's right, and I think he is, then there's no guarantee that a string is actually immutable after construction, therefore to cache the result would be wrong.

Whoa, whoa there! This is an interesting point to make (and yes it's very true), but I really doubt that this was taken into consideration in the implementation of GetHashCode. The statement "therefore to cache the result would be wrong" implies to me that the framework's attitude regarding strings is "Well, they're supposed to be immutable, but really if developers want to get sneaky they're mutable so we'll treat them as such." This is definitely not how the framework views strings. It fully relies on their immutability in so many ways (interning of string literals, assignment of all zero-length strings to string.Empty, etc.) that, basically, if you mutate a string, you're writing code whose behavior is entirely undefined and unpredictable.

I guess my point is that for the author(s) of this implementation to worry, "What if this string instance is modified between calls, even though the class as it is publicly exposed is immutable?" would be like for someone planning a casual outdoor BBQ to think to him-/herself, "What if someone brings an atomic bomb to the party?" Look, if someone brings an atom bomb, party's over.

like image 415
Dan Tao Avatar asked Jun 16 '10 13:06

Dan Tao


2 Answers

Obvious potential answer: because that will cost memory.

There's a cost/benefit analysis here:

Cost: 4 bytes for every string (and a quick test on each call to GetHashCode). Also make the string object mutable, which would obviously mean you'd need to be careful about the implementation - unless you always compute the hash code up-front, which is a cost of computing it once for every string, regardless of whether you ever hash it at all.

Benefit: Avoid recomputing the hash for string values hashed more than once

I would suggest that in many cases, there are many, many string objects and very few of them are hashed more than once - leading to a net cost. For some cases, obviously that won't be the case.

I don't think I'm in a good position to judge which comes up more often... I would hope that MS has instrumented various real apps. (I'd also hope that Sun did the same for Java, which does cache the hash...)

EDIT: I've just spoken to Eric Lippert about this (NDC is awesome :) and basically it is about the extra memory hit vs the limited benefits.

like image 172
Jon Skeet Avatar answered Oct 14 '22 14:10

Jon Skeet


Firstly - there's no knowing if caching this result would actually improve Dictionary<string, ...> et al because they don't necessarily use String.GetHashCode, because it uses an IComparer to get the hashcode for a string.

And if you follow the likely call chain for the StringComparer class, it ends up going through to the System.Globalization.CompareInfo class, which finally terminates at this method:

[SecurityCritical, SuppressUnmanagedCodeSecurity, DllImport("QCall",
   CharSet=CharSet.Unicode)]
private static extern int InternalGetGlobalizedHashCode(IntPtr handle, string
   localeName, string source, int length, int dwFlags);

There's no knowing if that library - which appears to be a native method - doesn't use some form of internal caching based on the underlying .Net object data structure that we can't get at once inside the .Net runtime.

However, the important thing to note with this is that one string can have many different hash codes based on how you chose to interpret the characters. Granted, this implementation is culture-inspecific - which is why it's unsuitable for these comparers.

So, whilst the additional memory storage could be a factor, I actually think it's because to store a hash code along with an instance of the string misleads the caller, and indeed the .Net internal dev team(!), into thinking that the string only has one hash code, when in fact it entirely depends on how you're going to interpret it - as a series of bytes (which most of us do not), or as a series of printable characters.

From a performance point of view, then, if we also accept that these comparers used by Dictionary<,> etc can't be using the internal implementation, not caching this result probably doesn't have much of an impact because, frankly, how often will this method actually get called in the real world: since most of the time a hashcode of a string is most likely calculated via some other mechanism.

EDIT

There's also the point made in Tim's answer(+1 there). If he's right, and I think he is, then there's no guarantee that a string is actually immutable after construction, therefore to cache the result would be wrong.

AN ADDITIONAL EDIT(!)

Dan makes the point that strings are meant to be immutable within the Net sphere and therefore that string should be free to cache it's own hashcode based on this. The problem here is that the .Net framework also provides a legitimate way to change the supposedly immutable string that does not involve privileged reflection or anything else. It's a fundamental problem with strings, it's a pointer to a buffer that you cannot control. Never mind in the C# world, what about in C++, where vectoring over and modifying memory buffers is common-place. Just because you ideally shouldn't do it doesn't mean that the framework should expect you not to.

.Net happens to provide this functionality, and therefore if this was a design decision by the .Net team in response to the kind of binary thuggery suggested by Tim, then they were very wise to have taken it into account. Whether they did, or whether it is by fluke, is another matter entirely! :)

like image 13
Andras Zoltan Avatar answered Oct 14 '22 13:10

Andras Zoltan