Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hashmap loadfactor - based on number of buckets occupied or number of entries in all buckets?

I was trying to understand the that does the rehashing of hashmap takes place while exceeding the number of buckets occupied or the total number of entries in all buckets. Means, we know that if 12 out of 16 (One entry in each bucket) of buckets are full (Considering default loadfactor and initial capacity), then we know on the next entry the hashmap will be rehashed. But what about that case when suppose only 3 buckets are occupied with 4 entries each (Total 12 entries but only 3 buckets out of 16 in use)?

So I tried to replicate this by making the worst hash function which will put all entries in a single bucket.

Here is my code.

class X {

    public Integer value;

    public X(Integer value) {
        super();
        this.value = value;
    }

    @Override
    public int hashCode() {
        return 1;
    }

    @Override
    public boolean equals(Object obj) {
        X a = (X) obj;
        if(this.value.equals(a.value)) {
            return true;
        }
        return false;
    }

}

Now I started putting values in hashmap.

HashMap<X, Integer> map = new HashMap<>();
    map.put(new X(1), 1);
    map.put(new X(2), 2);
    map.put(new X(3), 3);
    map.put(new X(4), 4);
    map.put(new X(5), 5);
    map.put(new X(6), 6);
    map.put(new X(7), 7);
    map.put(new X(8), 8);
    map.put(new X(9), 9);
    map.put(new X(10), 10);
    map.put(new X(11), 11);
    map.put(new X(12), 12);
    map.put(new X(13), 13);
    System.out.println(map.size());

All the nodes were entering the single bucket as expected, but I noticed that on the 9th entry, the hashmap rehashed and doubled its capacity. Now on the 10th entry, It again doubled its capacity.

With 8 entries With 9 entries

Can anyone explain how this is happening?

Thanks in advance.

like image 296
Sanchit Saxena Avatar asked Dec 30 '17 08:12

Sanchit Saxena


3 Answers

In HashMap, entries will be present in same bucket if their hashcodes are same. If unique Integer objects are placed inside a hashMap, their hashcode will change definitely because they are different objects.

But in your case all the objects are having same hashcode. which means as you specified all entries will be in a single bucket. Each of these buckets follow a specific data structure(linkedList/tree). Here the capacity is changing according to that specific datastructure and hashmap.

I have run JB Nizet's code (https://gist.github.com/jnizet/34ca08ba0314c8e857ea9a161c175f13/revisions) mentioned in the comment with loop limits 64 and 128 (adding 64 and 128 elements):

  1. While adding 64 elements: The capacity got doubled(128) while adding 49th element to the HashMap (because load factor is 0.75)
  2. While adding 128 elements: The capacity got doubled(256) while adding 97th element to the HashMap (also because load factor is 0.75).

After increasing capacity to 64, the HashMap works normal.

In summary, bucket uses linked list to a certain length(8 elements). After that it uses tree data structure (where there is fluctuation in capacity). Reason is that accessing tree structure (O(log(n))) is faster than linked list (O(n)).

enter image description here

This picture shows an inner array of a JAVA 8 HashMap with both trees (at bucket 0) and linked lists (at bucket 1,2 and 3). Bucket 0 is a Tree because it has more than 8 nodes (readmore).

Documentation on Hashmap and discussion on bucket in hashmap would be helpful in this regard.

like image 71
Nithin Avatar answered Oct 16 '22 15:10

Nithin


Responding to the comments more than the question itself, since your comments are more relevant in what you want to know actually.

The best and most relevant answer to the where this rehashing on bucket size is explained further is the source code itself. What you observe on the 9-th entry is expected and happens in this portion of the code:

for (int binCount = 0; ; ++binCount) {
    // some irrelevant lines skipped
    if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
        treeifyBin(tab, hash);
        break;
    }

where TREEIFY_THRESHOLD = 8 and binCount is the number of bins.

That treeifyBin method name is a bit misleading since, it might re-size, not treefiy a bin, that's the relevant part of the code from that method:

if (tab == null || (n = tab.length) < MIN_TREEIFY_CAPACITY)
        resize();

Notice that it will actually resize (read double it's size) and not make a Tree until MIN_TREEIFY_CAPACITY is reached (64).

like image 38
Eugene Avatar answered Oct 16 '22 16:10

Eugene


Read the source code of hashmap,

/**
 * The smallest table capacity for which bins may be treeified.
 * (Otherwise the table is resized if too many nodes in a bin.)
 * Should be at least 4 * TREEIFY_THRESHOLD to avoid conflicts
 * between resizing and treeification thresholds.
 */
static final int MIN_TREEIFY_CAPACITY = 64;

and you will see

  1. if the capacity does not reach MIN_TREEIFY_CAPACITY(64), and nodes on single bucket reach the TREEIFY_THRESHOLD, now map will resize.
  2. if the capacity exceeds MIN_TREEIFY_CAPACITY(64), and nodes on single bucket reach the TREEIFY_THRESHOLD, now map will treeify the nodes on bucket(aka bins in the source code).

Resize and treeify are two operations which can bring map reorganize, and the above decisions based on different scenarios is also a tradeoff.

like image 20
Kaka Diego Avatar answered Oct 16 '22 17:10

Kaka Diego