Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ImmutableCollections SetN implementation detail

I have sort of a hard time understanding an implementation detail from java-9 ImmutableCollections.SetN; specifically why is there a need to increase the inner array twice.

Suppose you do this:

Set.of(1,2,3,4) // 4 elements, but internal array is 8

More exactly I perfectly understand why this is done (a double expansion) in case of a HashMap - where you never (almost) want the load_factor to be one. A value of !=1 improves search time as entries are better dispersed to buckets for example.

But in case of an immutable Set - I can't really tell. Especially since the way an index of the internal array is chosen.

Let me provide some details. First how the index is searched:

 int idx = Math.floorMod(pe.hashCode() ^ SALT, elements.length);

pe is the actual value we put in the set. SALT is just 32 bits generated at start-up, once per JVM (this is the actual randomization if you want). elements.length for our example is 8 (4 elements, but 8 here - double the size).

This expression is like a negative-safe modulo operation. Notice that the same logical thing is done in HashMap for example ((n - 1) & hash) when the bucket is chosen.

So if elements.length is 8 for our case, then this expression will return any positive value that is less than 8 (0, 1, 2, 3, 4, 5, 6, 7).

Now the rest of the method:

 while (true) {
        E ee = elements[idx];
        if (ee == null) {
            return -idx - 1;
        } else if (pe.equals(ee)) {
            return idx;
        } else if (++idx == elements.length) {
            idx = 0;
        }
    }

Let's break it down:

if (ee == null) {
    return -idx - 1;

This is good, it means that the current slot in the array is empty - we can put our value there.

} else if (pe.equals(ee)) {
    return idx;

This is bad - slot is occupied and the already in place entry is equal to the one we want to put. Sets can't have duplicate elements - so an Exception is later thrown.

 else if (++idx == elements.length) {
      idx = 0;
 }

This means that this slot is occupied (hash collision), but elements are not equal. In a HashMap this entry would be put to the same bucket as a LinkedNode or TreeNode - but not the case here.

So index is incremented and the next position is tried (with the small caveat that it moves in a circular way when it reaches the last position).

And here is the question: if nothing too fancy (unless I'm missing something) is being done when searching the index, why is there a need to have an array twice as big? Or why the function was not written like this:

int idx = Math.floorMod(pe.hashCode() ^ SALT, input.length);

// notice the diff elements.length (8) and not input.length (4)
like image 716
Eugene Avatar asked Jul 27 '17 13:07

Eugene


1 Answers

The current implementation of SetN is a fairly simple closed hashing scheme, as opposed to the separate chaining approach used by HashMap. ("Closed hashing" is also confusingly known as "open addressing".) In a closed hashing scheme, elements are stored in the table itself, instead of being stored in a list or tree of elements that are linked from each table slot, which is separate chaining.

This implies that if two different elements hash to the same table slot, this collision needs to be resolved by finding another slot for one of the elements. The current SetN implementation resolves this using linear probing, where the table slots are checked sequentially (wrapping around at the end) until an open slot is found.

If you want to store N elements, they'll certainly fit into a table of size N. You can always find any element that's in the set, though you might have to probe several (or many) successive table slots to find it, because there will be lots of collisions. But if the set is probed for an object that's not a member, linear probing will have to check every table slot before it can determine that object isn't a member. With a full table, most probe operations will degrade to O(N) time, whereas the goal of most hash-based approaches is for operations to be O(1) time.

Thus we have a class space-time tradeoff. If we make the table larger, there will be empty slots sprinkled throughout the table. When storing items, there should be fewer collisions, and linear probing will find empty slots more quickly. The clusters of full slots next to each other will be smaller. Probes for non-members will proceed more quickly, since they're more likely to encounter an empty slot sooner while probing linearly -- possibly after not having to reprobe at all.

In bringing up the implementation, we ran a bunch of benchmarks using different expansion factors. (I used the term EXPAND_FACTOR in the code whereas most literature uses load factor. The reason is that the expansion factor is the reciprocal of the load factor, as used in HashMap, and using "load factor" for both meanings would be confusing.) When the expansion factor was near 1.0, the probe performance was quite slow, as expected. It improved considerably as the expansion factor was increased. The improvement was really flattening out by the time it got up to 3.0 or 4.0. We chose 2.0 since it got most of the performance improvement (close to O(1) time) while providing good space savings compared to HashSet. (Sorry, we haven't published these benchmark numbers anywhere.)

Of course, all of these are implementation specifics and may change from one release to the next, as we find better ways to optimize the system. I'm certain there are ways to improve the current implementation. (And fortunately we don't have to worry about preserving iteration order when we do this.)

A good discussion of open addressing and performance tradeoffs with load factors can be found in section 3.4 of

Sedgewick, Robert and Kevin Wayne. Algorithms, Fourth Edition. Addison-Wesley, 2011.

The online book site is here but note that the print edition has much more detail.

like image 88
Stuart Marks Avatar answered Nov 17 '22 11:11

Stuart Marks