Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why return (h = key.hashCode()) ^ (h >>> 16) other than key.hashcode?

Tags:

java

hashmap

I don't see this approach avoid the collision. I think if the key.hashcode is larger than table.length, there will be collision.

Updates: Actually I refer to the HashMap#hash in JDK 1.8, and was a little confused about the benifit of spread the higher bits downward. Now, I think I'm clear with the help of this link, the benifits is:

  • We don't need to do the % calculation, but using a more faster way - bit shift.

For the collision, if the number of key is larger than the length of the table, then there will be a collision no matter what hash method is used.

like image 236
caisil Avatar asked Jan 29 '23 23:01

caisil


2 Answers

Let's say you naively index into a hashtable using

int index = hashcode % table.length;

This may lead to many collisions in some common use cases. For example, assume table.length is a small power of two (like 32 or 64). In this case only the low order bits of the hashcode determine the index. This will cause lots of collisions if your object's hashcode only differ in the upper bits. The bit shift allows the upper bits of the hashcode to also influence the computed index.

like image 90
Evan Darke Avatar answered Feb 02 '23 10:02

Evan Darke


The reason for that is in the comments:

Computes key.hashCode() and spreads (XORs) higher bits of hash to lower. Because the table uses power-of-two masking, sets of hashes that vary only in bits above the current mask will always collide. (Among known examples are sets of Float keys holding consecutive whole numbers in small tables.)

What that is saying in simple words is that Key#hashcode (last bits that we care about) would be the same for Keys that are actually different. And that would create collisions since these entries will end-up in the same bucket.

Where an Entry goes is decided based on the number of existing buckets or from the last n - bits as you have already seen from :

int index = (n - 1) & hash

If hashmap would not re-hash again - it means that entries that don't differ in those last bits would end up in the same bucket, search time == slower.

The reason that XOR is used - because it has 50/50% distribution of 1 and 0 (as opposed to | or & that have 75/25 or 25/75).

And the & operation is used instead of %, not for speed only, but because hashcodes are ints and can be negative. modulo on a negative number will be negative - meaning a negative bucket... Thus & is used which will generate a positive index.

like image 20
Eugene Avatar answered Feb 02 '23 11:02

Eugene