Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenJDK's rehashing mechanism

Found this code on http://www.docjar.com/html/api/java/util/HashMap.java.html after searching for a HashMap implementation.

  264       static int hash(int h) {
  265           // This function ensures that hashCodes that differ only by
  266           // constant multiples at each bit position have a bounded
  267           // number of collisions (approximately 8 at default load factor).
  268           h ^= (h >>> 20) ^ (h >>> 12);
  269           return h ^ (h >>> 7) ^ (h >>> 4);
  270       }

Can someone shed some light on this? The comment tells us why this code is here but I would like to understand how this improves a bad hash value and how it guarantees that the positions have bounded number of collisions. What do these magic numbers mean?

like image 744
c_maker Avatar asked Oct 27 '11 20:10

c_maker


People also ask

How does rehashing work in a HashMap?

Rehashing of a hash map is done when the number of elements in the map reaches the maximum threshold value. When rehashing occurs a new hash function or even the same hash function could be used but the buckets at which the values are present could change.

Can we change load factor of HashMap?

Constructs an empty HashMap with the specified initial capacity and the default load factor (0.75). Constructs an empty HashMap with the specified initial capacity and load factor. As @Xoce mentioned, you can't change loadFactor later, I do agree with him on this. Use it while creating the hashmap.

What is HashMap and how it works?

HashMap stores entries into multiple singly linked lists, called buckets or bins. Default number of bins is 16 and it's always power of 2. HashMap uses hashCode() and equals() methods on keys for get and put operations. So HashMap key object should provide good implementation of these methods.

How HashMap works internally in Java 8 with example?

In Java 8, HashMap replaces linked list with a binary tree when the number of elements in a bucket reaches certain threshold. While converting the list to binary tree, hashcode is used as a branching variable.


1 Answers

In order for it to make any sense it has to be combined with an understanding of how HashMap allocates things in to buckets. This is the trivial function by which a bucket index is chosen:

static int indexFor(int h, int length) {
    return h & (length-1);
}

So you can see, that with a default table size of 16, only the 4 least significant bits of the hash actually matter for allocating buckets! (16 - 1 = 15, which masks the hash by 1111b)

This could clearly be bad news if your hashCode function returned:

10101100110101010101111010111111

01111100010111011001111010111111

11000000010100000001111010111111
//etc etc etc

Such a hash function would not likely be "bad" in any way that is visible to its author. But if you combine it with the way the map allocates buckets, boom, MapFail(tm).

If you keep in mind that h is a 32 bit number, those are not magic numbers at all. It is systematically xoring the most significant bits of the number rightward into the least significant bits. The purpose is so that "differences" in the number that occur anywhere "across" it when viewed in binary become visible down in the least significant bits.

Collisions become bounded because the number of different numbers that have the same relevant LSBs is now significantly bounded because any differences that occur anywhere in the binary representation are compressed into the bits that matter for bucket-ing.

like image 97
Affe Avatar answered Oct 27 '22 12:10

Affe