Performance of Guava's ImmutableSet.contains

Tags:

Guava's ImmutableSet seems to perform quite poorly in my benchmark concerning contains. For some sizes it gets even much slower than List:

  size            benchmark         ns linear runtime
100000         ListContains  110279.54 ==
100000          SetContains       7.15 =
100000 ImmutableSetContains   76716.47 =
200000         ListContains  275367.66 =====
200000          SetContains       7.34 =
200000 ImmutableSetContains  322185.50 ======
500000         ListContains  935210.10 ====================
500000          SetContains       7.79 =
500000 ImmutableSetContains 1382765.76 ==============================

Basically, I fill a set with a few thousands negative integers and test contains with non-negative ones. The code is trivial but a bit too long for pasting in a small text area, so please look here.

I wonder what's going on here. Probably, I hit some degenerate case, although I obviously didn't try to. Or maybe I've just blew the benchmark. Otherwise, I wonder if it can and should be fixed.

The solution was to change the smearing function by replacing

hashCode ^= (hashCode >>> 20) ^ (hashCode >>> 12);
return hashCode ^ (hashCode >>> 7) ^ (hashCode >>> 4);

return C2 * Integer.rotateLeft(hashCode * C1, 15);

This takes about the same time and might have some disadvantage, but solves the current problem by nicely spreading the hashes.

348

asked Sep 24 '12 21:09

maaartinus

1 Answers

The explanation is actually very simple. Imagine the integers lie in the set M = {0, 1, ..., N-1} for some N, which is a power of two. And so do the hashes, because of how Integer.hashCode is defined. The hashes get processed via a function smear identical to this one in order to minimize collision in the lowest bits in some common cases.

A table of size 2*N gets allocated and the members of M get put into it. As smear involves only right shifts, it maps M onto itself, which means that a continuous range of the table gets filled. So lets say that all slots in the left half of the table are used and the other half is unused.

When contains(o) gets invoked, the search starts in a slot which position is determined by o.hashCode(). If o gets found, the result is true, if an empty slot gets hit, the result is false. Otherwise, the search proceeds to another slot. In order to minimize cache misses, linear probing gets used.

When we are unlucky enough to start the search at the first used slot, all of them have to be traversed, which means N steps. Starting at a random position means N/4 steps on the average.

What happened in my benchmark is not exactly as above, but the reason for its bad performance is the same. Making the sizes to powers of two makes the problem worse.

answered Nov 23 '22 08:11

maaartinus

Related questions
                            
                                Hibernate Set Or List
                            
                                Efficient Way of making a set of tuple in which the order of tuple doesn't matters
                            
                                How to insert into STL set?
                            
                                How do you determine if you can draw a circle around a set of points, such that the points from the other set aren't inside it?
                            
                                Why does deduction fail for std::set in GCC?
                            
                                Is it worth making get and set methods in OOP?
                            
                                "add to set" returns a boolean in java - what about python?
                            
                                Get unique values from pandas series of lists
                            
                                Why Elixir's MapSet becomes unordered after 32 elements?
                            
                                Swift Set of Structure Types
                            
                                MongoDb - How can I update multiple elements of a nested object using $set?
                            
                                How to get random element from a set in Swift?
                            
                                Convert Set<Integer> to Set<String> in Java
                            
                                making an array of sets in python
                            
                                How to properly check if a Set is empty
                            
                                Optimized implementations of java.util.Map and java.util.Set?
                            
                                How to get the difference between two maps Java?
                            
                                Is it valid to reduce on an empty set of sets?
                            
                                ast.literal_eval() support for set literals in Python 2.7?
                            
                                Pairwise Set Intersection in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Performance of Guava's ImmutableSet.contains

Tags:

set

immutability

guava

caliper

maaartinus

People also ask

1 Answers

maaartinus

Recent Activity

Donate For Us