Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Collectors.toSet implementation detail

I was looking at Collectors.toSet implementation under jdk-8 and pretty much saw the obvious thing:

 public static <T> Collector<T, ?, Set<T>> toSet() {
    return new CollectorImpl<>(
       (Supplier<Set<T>>) HashSet::new, 
       Set::add,
       (left, right) -> { left.addAll(right); return left; }, // combiner
       CH_UNORDERED_ID);

Look at the combiner for a moment; this has been discussed before here, but the idea is that a combiner folds from the second argument into the first. And that obviously happens here.

But then I looked into jdk-9 implementation and saw this:

 public static <T> Collector<T, ?, Set<T>> toSet() {
    return new CollectorImpl<>(
       (Supplier<Set<T>>) HashSet::new, 
       Set::add,
       (left, right) -> {
          if (left.size() < right.size()) {
            right.addAll(left); return right;
          } else {
             left.addAll(right); return left;
          }
       },
       CH_UNORDERED_ID);

Now why this happens is a bit obvious - it takes less time to add less elements to a bigger Set, then the other way around. But is that really cheaper than plain addAll, consider the extra overhead for the branch too?

Also this breaks my law about always folding left...

Can someone shed some light here?

like image 658
Eugene Avatar asked May 03 '17 18:05

Eugene


1 Answers

The combiner function of a Collector will receive the left and right appropriately, if there is an encounter order to maintain, however, it is up to the Collector, how it will actually combine these two arguments.

The documentation states:

A function that accepts two partial results and merges them. The combiner function may fold state from one argument into the other and return that, or may return a new result container.

For collecting into a List, it would be disastrous if we just swap left.addAll(right) to right.addAll(left), but for an unordered Set, it doesn’t matter. The toSet() collector even reports the UNORDERED characteristic to hint to the Stream (or any client code) that it won’t even matter which argument is provided as left or right, so a parallel stream could combine arbitrary partial results, whatever has finished first, in other words, it may behave like an unordered stream, even if the source had an encounter order (Java 8’s implementation doesn’t use that opportunity).

Regarding whether it is worth it… We are comparing a single additional branch with potentially thousands of add operations we can save, each of them bearing multiple conditional branches internally…

like image 162
Holger Avatar answered Oct 17 '22 23:10

Holger