Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does a HashMap rehash the hashcode supplied by the key object?

I am reading the code of the HashMap class provided by the Java 1.6 API and unable to fully understand the need of the following operation (found in the body of put and get methods):

int hash = hash(key.hashCode());

where the method hash() has the following body:

 private static int hash(int h) {
         h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

This effectively recalculates the hash by executing bit operations on the supplied hashcode. I'm unable to understand the need to do so even though the API states it as follows:

This is critical because HashMap uses power-of-two length hash tables, that otherwise encounter collisions for hashCodes that do not differ in lower bits.

I do understand that the key value pars are stored in an array of data structures, and that the index location of an item in this array is determined by its hash. What I fail to understand is how would this function add any value to the hash distribution.

like image 481
VGDIV Avatar asked Mar 29 '10 13:03

VGDIV


People also ask

What is rehashing in HashMap?

It can be also defined as rehashing is the process of re-calculating the hash code of already stored entries and moving them to a bigger size hash map when the number of elements in the map reaches the maximum threshold value. In simple words, rehashing is the reverse of hashing process. It retains the performance.

How do you rehash a HashMap?

Steps for Rehashing as follows: If the load factor is greater than its threshold value (default 0.75 for HashMap), then start Rehash. For Rehashing, initialize a new array of double the size of the previous one. Copy all elements into a new array and make it the new bucket array.

What happens if two keys have the same hashCode in HashMap?

If two keys are the same ( equals() returns true when you compare them), their hashCode() method must return the same number. If keys violate this, then keys that are equal might be stored in different buckets, and the hashmap would not be able to find key-value pairs (because it's going to look in the same bucket).

Why is rehashing needed?

Why rehashing? Rehashing is done because whenever key value pairs are inserted into the map, the load factor increases, which implies that the time complexity also increases as explained above. This might not give the required time complexity of O(1).


2 Answers

As Helper wrote, it is there just in case the existing hash function for the key objects is faulty and does not do a good-enough job of mixing the lower bits. According to the source quoted by pgras,

 /**
  * Returns index for hash code h.
  */
 static int indexFor(int h, int length) {
     return h & (length-1);
 }

The hash is being ANDed in with a power-of-two length (therefore, length-1 is guaranteed to be a sequence of 1s). Due to this ANDing, only the lower bits of h are being used. The rest of h is ignored. Imagine that, for whatever reason, the original hash only returns numbers divisible by 2. If you used it directly, the odd-numbered positions of the hashmap would never be used, leading to a x2 increase in the number of collisions. In a truly pathological case, a bad hash function can make a hashmap behave more like a list than like an O(1) container.

Sun engineers must have run tests that show that too many hash functions are not random enough in their lower bits, and that many hashmaps are not large enough to ever use the higher bits. Under these circumstances, the bit operations in HashMap's hash(int h) can provide a net improvement over most expected use-cases (due to lower collision rates), even though extra computation is required.

like image 143
tucuxi Avatar answered Nov 15 '22 17:11

tucuxi


I somewhere read this is done to ensure a good distribution even if your hashCode implementation, well, err, sucks.

like image 27
helpermethod Avatar answered Nov 15 '22 17:11

helpermethod