Found this code on http://www.docjar.com/html/api/java/util/HashMap.java.html after searching for a HashMap implementation.
264 static int hash(int h) {
265 // This function ensures that hashCodes that differ only by
266 // constant multiples at each bit position have a bounded
267 // number of collisions (approximately 8 at default load factor).
268 h ^= (h >>> 20) ^ (h >>> 12);
269 return h ^ (h >>> 7) ^ (h >>> 4);
270 }
Can someone shed some light on this? The comment tells us why this code is here but I would like to understand how this improves a bad hash value and how it guarantees that the positions have bounded number of collisions. What do these magic numbers mean?
Rehashing of a hash map is done when the number of elements in the map reaches the maximum threshold value. When rehashing occurs a new hash function or even the same hash function could be used but the buckets at which the values are present could change.
Constructs an empty HashMap with the specified initial capacity and the default load factor (0.75). Constructs an empty HashMap with the specified initial capacity and load factor. As @Xoce mentioned, you can't change loadFactor later, I do agree with him on this. Use it while creating the hashmap.
HashMap stores entries into multiple singly linked lists, called buckets or bins. Default number of bins is 16 and it's always power of 2. HashMap uses hashCode() and equals() methods on keys for get and put operations. So HashMap key object should provide good implementation of these methods.
In Java 8, HashMap replaces linked list with a binary tree when the number of elements in a bucket reaches certain threshold. While converting the list to binary tree, hashcode is used as a branching variable.
In order for it to make any sense it has to be combined with an understanding of how HashMap allocates things in to buckets. This is the trivial function by which a bucket index is chosen:
static int indexFor(int h, int length) {
return h & (length-1);
}
So you can see, that with a default table size of 16, only the 4 least significant bits of the hash actually matter for allocating buckets! (16 - 1 = 15, which masks the hash by 1111b)
This could clearly be bad news if your hashCode function returned:
10101100110101010101111010111111
01111100010111011001111010111111
11000000010100000001111010111111
//etc etc etc
Such a hash function would not likely be "bad" in any way that is visible to its author. But if you combine it with the way the map allocates buckets, boom, MapFail(tm).
If you keep in mind that h is a 32 bit number, those are not magic numbers at all. It is systematically xoring the most significant bits of the number rightward into the least significant bits. The purpose is so that "differences" in the number that occur anywhere "across" it when viewed in binary become visible down in the least significant bits.
Collisions become bounded because the number of different numbers that have the same relevant LSBs is now significantly bounded because any differences that occur anywhere in the binary representation are compressed into the bits that matter for bucket-ing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With