Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance of Guava's ImmutableSet.contains

Guava's ImmutableSet seems to perform quite poorly in my benchmark concerning contains. For some sizes it gets even much slower than List:

  size            benchmark         ns linear runtime
100000         ListContains  110279.54 ==
100000          SetContains       7.15 =
100000 ImmutableSetContains   76716.47 =
200000         ListContains  275367.66 =====
200000          SetContains       7.34 =
200000 ImmutableSetContains  322185.50 ======
500000         ListContains  935210.10 ====================
500000          SetContains       7.79 =
500000 ImmutableSetContains 1382765.76 ==============================

Basically, I fill a set with a few thousands negative integers and test contains with non-negative ones. The code is trivial but a bit too long for pasting in a small text area, so please look here.

I wonder what's going on here. Probably, I hit some degenerate case, although I obviously didn't try to. Or maybe I've just blew the benchmark. Otherwise, I wonder if it can and should be fixed.


The solution was to change the smearing function by replacing

hashCode ^= (hashCode >>> 20) ^ (hashCode >>> 12);
return hashCode ^ (hashCode >>> 7) ^ (hashCode >>> 4);

by

return C2 * Integer.rotateLeft(hashCode * C1, 15);

This takes about the same time and might have some disadvantage, but solves the current problem by nicely spreading the hashes.

like image 348
maaartinus Avatar asked Sep 24 '12 21:09

maaartinus


People also ask

What is ImmutableSet?

What Is an Immutable Set? In general, an immutable object will not change its internal state once we create it. This makes it thread-safe by default. The same logic applies to immutable sets.

Is Guava ImmutableSet ordered?

Guava's ImmutableSet provides a high-performance, immutable Set with reliable, user-specified iteration order. There are also variations like ImmutableSortedSet.

Is immutable Set ordered?

In short, ImmutableSet uses Array + HashSet to implement the ordered Set. Because it is Immutable, there is no requirement to implement any modification methods.


1 Answers

The explanation is actually very simple. Imagine the integers lie in the set M = {0, 1, ..., N-1} for some N, which is a power of two. And so do the hashes, because of how Integer.hashCode is defined. The hashes get processed via a function smear identical to this one in order to minimize collision in the lowest bits in some common cases.

A table of size 2*N gets allocated and the members of M get put into it. As smear involves only right shifts, it maps M onto itself, which means that a continuous range of the table gets filled. So lets say that all slots in the left half of the table are used and the other half is unused.

When contains(o) gets invoked, the search starts in a slot which position is determined by o.hashCode(). If o gets found, the result is true, if an empty slot gets hit, the result is false. Otherwise, the search proceeds to another slot. In order to minimize cache misses, linear probing gets used.

When we are unlucky enough to start the search at the first used slot, all of them have to be traversed, which means N steps. Starting at a random position means N/4 steps on the average.

What happened in my benchmark is not exactly as above, but the reason for its bad performance is the same. Making the sizes to powers of two makes the problem worse.

like image 66
maaartinus Avatar answered Nov 23 '22 08:11

maaartinus