Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unexpected running times for HashSet code

So originally, I had this code:

import java.util.*;

public class sandbox {
    public static void main(String[] args) {
        HashSet<Integer> hashSet = new HashSet<>();
        for (int i = 0; i < 100_000; i++) {
            hashSet.add(i);
        }

        long start = System.currentTimeMillis();

        for (int i = 0; i < 100_000; i++) {
            for (Integer val : hashSet) {
                if (val != -1) break;
            }

            hashSet.remove(i);
        }

        System.out.println("time: " + (System.currentTimeMillis() - start));
    }
}

It takes around 4s to run the nested for loops on my computer and I don't understand why it took so long. The outer loop runs 100,000 times, the inner for loop should run 1 time (because any value of hashSet will never be -1) and the removing of an item from a HashSet is O(1), so there should be around 200,000 operations. If there are typically 100,000,000 operations in a second, how come my code takes 4s to run?

Additionally, if the line hashSet.remove(i); is commented out, the code only takes 16ms. If the inner for loop is commented out (but not hashSet.remove(i);), the code only takes 8ms.

like image 915
scdivad Avatar asked Dec 29 '19 18:12

scdivad


Video Answer


1 Answers

You've created a marginal use case of HashSet, where the algorithm degrades to the quadratic complexity.

Here is the simplified loop that takes so long:

for (int i = 0; i < 100_000; i++) {
    hashSet.iterator().next();
    hashSet.remove(i);
}

async-profiler shows that almost all time is spent inside java.util.HashMap$HashIterator() constructor:

    HashIterator() {
        expectedModCount = modCount;
        Node<K,V>[] t = table;
        current = next = null;
        index = 0;
        if (t != null && size > 0) { // advance to first entry
--->        do {} while (index < t.length && (next = t[index++]) == null);
        }
    }

The highlighted line is a linear loop that searches for the first non-empty bucket in the hash table.

Since Integer has the trivial hashCode (i.e. hashCode is equal to the number itself), it turns out that consecutive integers mostly occupy the consecutive buckets in the hash table: number 0 goes to the first bucket, number 1 goes to the second bucket, etc.

Now you remove the consecutive numbers from 0 to 99999. In the simplest case (when the bucket contains a single key) the removal of a key is implemented as nulling out the corresponding element in the bucket array. Note that the table is not compacted or rehashed after removal.

So, the more keys you remove from the beginning of the bucket array, the longer HashIterator needs to find the first non-empty bucket.

Try to remove the keys from the other end:

hashSet.remove(100_000 - i);

The algorithm will become dramatically faster!

like image 119
apangin Avatar answered Oct 13 '22 02:10

apangin