Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why hash method in HashMap [duplicate]

Tags:

java

hashmap

Java doc of hash method states,

Retrieve object hash code and applies a supplemental hash function to the result hash, which defends against poor quality hash functions. This is critical because HashMap uses power-of-two length hash tables, that otherwise encounter collisions for hashCodes that do not differ in lower bits.

What I am not able to understand is,

1) Why HashMap uses power-of-two length hash tables?

It is stated while declaring table too:

/**
 * The table, resized as necessary. Length MUST Always be a power of two.
 */
transient Entry<K,V>[] table;

Why this constraint?

2) What does otherwise encounter collisions for hashCodes that do not differ in lower bits. mean?

like image 867
codingenious Avatar asked Oct 21 '22 10:10

codingenious


1 Answers

The purpose of a hashmap is to very quickly narrow down how many objects you need to look at (ideally 0 or 1) when searching for a specific key.

The general method for HashMap.get(key) is as follows:

  1. Call key.hashCode() to get a single integer that represents the object.

  2. Look in a hash "bucket" based on that hashcode, which can contain zero or more entries.

  3. Go through each entry in the bucket and find if any entry's key is .equals(key). If so, return it. If no entry in the bucket has an equal key to the one searched for, return null.

The difference between a good hashmap and a bad hashmap is speed. You have to balance all three of these concerns:

  1. How quickly can you transform the key into a hashcode?

  2. How often do two different keys map to the same hashcode?

  3. How often will you put two keys with different hashcodes into the same "bucket"?

Java's designers have chosen a set of tradeoffs they think balances best. There is no right answer, but you do have to choose a specific approach, and write into the documentation what that approach is.

Java's designers likely have some statistic evidence based on typical data added to hashmaps.

They chose to convert hashcode to bucket by extracting the lowest n bits of the hashcode, because those vary more often than the upper bits. They chose extracting bits over another typical method of converting hashcode to bucket (integer remainder after dividing by a prime number) because it's typically a faster operation on the platforms Java is most commonly deployed to.

What Java's designers may have found is that step 1, the implementation of hashCode(), is written by Java users, and can often be terrible, returning the same hashCode for lots of objects they want to store in the same hashmap. Imagine if the hashCode was this:

public class MyInteger {
    final int i;
    public MyInteger(int i) {
        this.i = i;
    }
    public int hashCode() {
        return i << 24; // will return 0x00000000, 0x01000000, etc.
    }
    public boolean equals(Object o) {
        return (o != null) && (o instanceof MyInteger) && ((MyInteger)o).i == i;
    }
}

This is what they call "poor quality"; the lower bits of the hashcode don't vary very much. In this pathological implementation, the lower 24 bits don't vary at all!

In this case, for hashmaps any smaller than 16,777,216 buckets, every single key that could go in the hashmap will go to bucket 0. The other 16,777,215 buckets will be empty.

Other people's hashcodes may not be as bad as this, but they're bad enough that Java's designers chose to add a second hashcode to help improve the chance that two different keys will go into two different buckets, thus reducing how many objects need to be checked for equality each time a given key is retrieved.

like image 107
Stuart Caie Avatar answered Oct 23 '22 05:10

Stuart Caie