Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is Object.GetHashCode() implemented in CLR & JVM?

I've been musing about this for some time: how exactly is Object.GetHashCode implemented in the CLR or Java? The contract for this method is that if it is called on the same object instance, it should always return the same value.

Note that I'm talking about the default implementation of GetHashCode(). Derived classes are not required to override this method. If they choose not to do so, they will in essence have reference semantics: equality equals "pointer equality" by default when used in hash tables &c. This means that somehow, the runtime has to provide a constant hashcode for the object throughout its lifetime.

If the machine I'm running on is 32-bit, and if the object instance never moved in memory, one could theoretically return the address of object, reinterpreted as Int32. That would be nice since all distinct objects have distinct addresses and therefore would have different hash codes.

However, this approach is flawed, amongst other things because:

  • if the garbage collector moves the object in memory, its address changes, and so would its hash code in violation of the contract that the hashcode should be the same for the lifetime of the object.

  • On a 64-bit system, the object's address is too wide to fit into Int32.

  • Because managed objects tend to be aligned to some even power of 2, the bottommost bits will always be zero. This may cause bad distribution patterns when the hash codes are used to index into a hash table.

In .NET, a System.Object consists of a sync block and a type handle and nothing more, so the hashcode cannot be cached in the instance itself. Somehow the runtime is able to provide a persistent hashcode. How? And how do Java, Mono, and other runtimes do this?

like image 207
John Källén Avatar asked Apr 07 '11 13:04

John Källén


2 Answers

No, not the address, that can't work with a garbage collector moving objects. It is intuitively simple, it can be a random number as long as it is stored after it is generated. It does get stored in the object, the syncblk. That field stores more than one object property, it is replaced by an index for an allocated syncblk if more than one such property needs to be stored.

The .NET algorithm uses the managed thread ID so that threads are not likely to generate the same sequence:

inline DWORD GetNewHashCode()
{
    // Every thread has its own generator for hash codes so that we won't get into a situation
    // where two threads consistently give out the same hash codes.        
    // Choice of multiplier guarantees period of 2**32 - see Knuth Vol 2 p16 (3.2.1.2 Theorem A)
    DWORD multiplier = m_ThreadId*4 + 5;
    m_dwHashCodeSeed = m_dwHashCodeSeed*multiplier + 1;
    return m_dwHashCodeSeed;
}

The seed is stored per-thread so no lock is required. At least that's what is used in the SSCLI20 version. No idea about Java, I imagine it is similar.

like image 119
Hans Passant Avatar answered Oct 24 '22 11:10

Hans Passant


As a JVM implementer, I can say that the base hashcode IS typically related to the address of the object. It's not typically exactly the address, but some mangling of it in reasonable ways. We do magic to ensure the hashCode is stable through the life of the object (even across GC, even if the object moves, etc..)

I strongly recommend implementing a good type-specific hashCode() for all objects you're going to be hashing. That Object implements it doesn't mean it's ideal for your use.

like image 35
Trent Gray-Donald Avatar answered Oct 24 '22 13:10

Trent Gray-Donald