Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance for Java Stream.concat VS Collection.addAll

For the purpose of combining two sets of data in a stream.

Stream.concat(stream1, stream2).collect(Collectors.toSet());

Or

stream1.collect(Collectors.toSet())
       .addAll(stream2.collect(Collectors.toSet()));

Which is more efficient and why?

like image 525
alan7678 Avatar asked Jan 12 '17 20:01

alan7678


People also ask

Does Java Stream improve performance?

In Java8 Streams, performance is achieved by parallelism, laziness, and using short-circuit operations, but there is a downside as well, and we need to be very cautious while choosing Streams, as it may degrade the performance of your application.

Which is faster foreach or stream in Java?

parallel foreach() Works on multithreading concept: The only difference between stream(). forEach() and parallel foreach() is the multithreading feature given in the parallel forEach(). This is way faster that foreach() and stream.

How to combine two Collections in java?

Using the concat() Method. The static method concat() combines two Streams logically by creating a lazily concatenated Stream whose elements are all the elements of the first Stream followed by all the elements of the second Stream.

How do I combine two streams in Java?

concat() in Java. Stream. concat() method creates a concatenated stream in which the elements are all the elements of the first stream followed by all the elements of the second stream. The resulting stream is ordered if both of the input streams are ordered, and parallel if either of the input streams is parallel.


2 Answers

For the sake of readability and intention, Stream.concat(a, b).collect(toSet()) is way clearer than the second alternative.

For the sake of the question, which is "what is the most efficient", here a JMH test (I'd like to say that I don't use JMH that much, there might be some room to improve my benchmark test):

Using JMH, with the following code:

package stackoverflow;

import java.util.HashSet;
import java.util.Set;
import java.util.concurrent.TimeUnit;
import java.util.stream.Collectors;
import java.util.stream.Stream;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
import org.openjdk.jmh.infra.Blackhole;

@State(Scope.Benchmark)
@Warmup(iterations = 2)
@Fork(1)
@Measurement(iterations = 10)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode({ Mode.AverageTime})
public class StreamBenchmark {
  private Set<String> s1;
  private Set<String> s2;

  @Setup
  public void setUp() {
    final Set<String> valuesForA = new HashSet<>();
    final Set<String> valuesForB = new HashSet<>();
    for (int i = 0; i < 1000; ++i) {
      valuesForA.add(Integer.toString(i));
      valuesForB.add(Integer.toString(1000 + i));
    }
    s1 = valuesForA;
    s2 = valuesForB;
  }

  @Benchmark
  public void stream_concat_then_collect_using_toSet(final Blackhole blackhole) {
    final Set<String> set = Stream.concat(s1.stream(), s2.stream()).collect(Collectors.toSet());
    blackhole.consume(set);
  }

  @Benchmark
  public void s1_collect_using_toSet_then_addAll_using_toSet(final Blackhole blackhole) {
    final Set<String> set = s1.stream().collect(Collectors.toSet());
    set.addAll(s2.stream().collect(Collectors.toSet()));
    blackhole.consume(set);
  }
}

You get these result (I omitted some part for readability).

Result "s1_collect_using_toSet_then_addAll_using_toSet":
  156969,172 ±(99.9%) 4463,129 ns/op [Average]
  (min, avg, max) = (152842,561, 156969,172, 161444,532), stdev = 2952,084
  CI (99.9%): [152506,043, 161432,301] (assumes normal distribution)

Result "stream_concat_then_collect_using_toSet":
  104254,566 ±(99.9%) 4318,123 ns/op [Average]
  (min, avg, max) = (102086,234, 104254,566, 111731,085), stdev = 2856,171
  CI (99.9%): [99936,443, 108572,689] (assumes normal distribution)
# Run complete. Total time: 00:00:25

Benchmark                                                       Mode  Cnt       Score      Error  Units
StreamBenchmark.s1_collect_using_toSet_then_addAll_using_toSet  avgt   10  156969,172 ± 4463,129  ns/op
StreamBenchmark.stream_concat_then_collect_using_toSet          avgt   10  104254,566 ± 4318,123  ns/op

The version using Stream.concat(a, b).collect(toSet()) should perform faster (if I read well the JMH numbers).

On the other hand, I think this result is normal because you don't create an intermediate set (this has some cost, even with HashSet), and as said in comment of first answer, the Stream is lazily concatenated.

Using a profiler you might see in which part it is slower. You might also want to use toCollection(() -> new HashSet(1000)) instead of toSet() to see if the problem lies in growing the HashSet internal hash array.

like image 161
NoDataFound Avatar answered Nov 23 '22 23:11

NoDataFound


First of all, it must be emphasized that the second variant is incorrect. The toSet() collector returns a Set with “no guarantees on the type, mutability, serializability, or thread-safety”. If mutability is not guaranteed, it is not correct to invoke addAll on the resulting Set.

It happens to work with the current version of the reference implementation, where a HashSet will be created, but might stop working in a future version or alternative implementations. In order to fix this, you have to replace toSet() with toCollection(HashSet::new) for the first Stream’s collect operation.

This leads to the situation that the second variant is not only less efficient with the current implementation, as shown in this answer, it might also prevent future optimizations made to the toSet() collector, by insisting on the result being of the exact type HashSet. Also, unlike the toSet() collector, the toCollection(…) collector has no way of detecting that the target collection is unordered, which might have a performance relevance in future implementations.

like image 34
Holger Avatar answered Nov 23 '22 23:11

Holger