For the purpose of combining two sets of data in a stream. <pre class="prettyprint"><code>Stream.concat(stream1, stream2).collect(Collectors.toSet()); </code></pre> Or <pre class="prettyprint"><code>stream1.collect(Collectors.toSet()) .addAll(stream2.collect(Collectors.toSet())); </code></pre> Which is more efficient and why?

For the sake of readability and intention, <code>Stream.concat(a, b).collect(toSet())</code> is way clearer than the second alternative. For the sake of the question, which is "what is the most efficient", here a JMH test (I'd like to say that I don't use JMH that much, there might be some room to improve my benchmark test): Using JMH, with the following code: <pre class="prettyprint"><code>package stackoverflow; import java.util.HashSet; import java.util.Set; import java.util.concurrent.TimeUnit; import java.util.stream.Collectors; import java.util.stream.Stream; import org.openjdk.jmh.annotations.Benchmark; import org.openjdk.jmh.annotations.BenchmarkMode; import org.openjdk.jmh.annotations.Fork; import org.openjdk.jmh.annotations.Measurement; import org.openjdk.jmh.annotations.Mode; import org.openjdk.jmh.annotations.OutputTimeUnit; import org.openjdk.jmh.annotations.Scope; import org.openjdk.jmh.annotations.Setup; import org.openjdk.jmh.annotations.State; import org.openjdk.jmh.annotations.Warmup; import org.openjdk.jmh.infra.Blackhole; @State(Scope.Benchmark) @Warmup(iterations = 2) @Fork(1) @Measurement(iterations = 10) @OutputTimeUnit(TimeUnit.NANOSECONDS) @BenchmarkMode({ Mode.AverageTime}) public class StreamBenchmark { private Set<String> s1; private Set<String> s2; @Setup public void setUp() { final Set<String> valuesForA = new HashSet<>(); final Set<String> valuesForB = new HashSet<>(); for (int i = 0; i < 1000; ++i) { valuesForA.add(Integer.toString(i)); valuesForB.add(Integer.toString(1000 + i)); } s1 = valuesForA; s2 = valuesForB; } @Benchmark public void stream_concat_then_collect_using_toSet(final Blackhole blackhole) { final Set<String> set = Stream.concat(s1.stream(), s2.stream()).collect(Collectors.toSet()); blackhole.consume(set); } @Benchmark public void s1_collect_using_toSet_then_addAll_using_toSet(final Blackhole blackhole) { final Set<String> set = s1.stream().collect(Collectors.toSet()); set.addAll(s2.stream().collect(Collectors.toSet())); blackhole.consume(set); } } </code></pre> You get these result (I omitted some part for readability). <pre class="prettyprint"><code>Result "s1_collect_using_toSet_then_addAll_using_toSet": 156969,172 ±(99.9%) 4463,129 ns/op [Average] (min, avg, max) = (152842,561, 156969,172, 161444,532), stdev = 2952,084 CI (99.9%): [152506,043, 161432,301] (assumes normal distribution) Result "stream_concat_then_collect_using_toSet": 104254,566 ±(99.9%) 4318,123 ns/op [Average] (min, avg, max) = (102086,234, 104254,566, 111731,085), stdev = 2856,171 CI (99.9%): [99936,443, 108572,689] (assumes normal distribution) # Run complete. Total time: 00:00:25 Benchmark Mode Cnt Score Error Units StreamBenchmark.s1_collect_using_toSet_then_addAll_using_toSet avgt 10 156969,172 ± 4463,129 ns/op StreamBenchmark.stream_concat_then_collect_using_toSet avgt 10 104254,566 ± 4318,123 ns/op </code></pre> The version using <code>Stream.concat(a, b).collect(toSet())</code> should perform faster (if I read well the JMH numbers). On the other hand, I think this result is normal because you don't create an intermediate set (this has some cost, even with <code>HashSet</code>), and as said in comment of first answer, the <code>Stream</code> is lazily concatenated. Using a profiler you might see in which part it is slower. You might also want to use <code>toCollection(() -> new HashSet(1000))</code> instead of <code>toSet()</code> to see if the problem lies in growing the <code>HashSet</code> internal hash array.

First of all, it must be emphasized that the second variant is incorrect. The <code>toSet()</code> collector returns a <code>Set</code> with “no guarantees on the type, mutability, serializability, or thread-safety”. If mutability is not guaranteed, it is not correct to invoke <code>addAll</code> on the resulting <code>Set</code>. It happens to work with the current version of the reference implementation, where a <code>HashSet</code> will be created, but might stop working in a future version or alternative implementations. In order to fix this, you have to replace <code>toSet()</code> with <code>toCollection(HashSet::new)</code> for the first Stream’s <code>collect</code> operation. This leads to the situation that the second variant is not only less efficient with the current implementation, as shown in this answer, it might also prevent future optimizations made to the <code>toSet()</code> collector, by insisting on the result being of the exact type <code>HashSet</code>. Also, unlike the <code>toSet()</code> collector, the <code>toCollection(…)</code> collector has no way of detecting that the target collection is unordered, which might have a performance relevance in future implementations.

Performance for Java Stream.concat VS Collection.addAll

Tags:

java

java-8

java-stream

For the purpose of combining two sets of data in a stream.

Stream.concat(stream1, stream2).collect(Collectors.toSet());

stream1.collect(Collectors.toSet())
       .addAll(stream2.collect(Collectors.toSet()));

Which is more efficient and why?

525

asked Jan 12 '17 20:01

alan7678

2 Answers

For the sake of readability and intention, Stream.concat(a, b).collect(toSet()) is way clearer than the second alternative.

For the sake of the question, which is "what is the most efficient", here a JMH test (I'd like to say that I don't use JMH that much, there might be some room to improve my benchmark test):

Using JMH, with the following code:

package stackoverflow;

import java.util.HashSet;
import java.util.Set;
import java.util.concurrent.TimeUnit;
import java.util.stream.Collectors;
import java.util.stream.Stream;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
import org.openjdk.jmh.infra.Blackhole;

@State(Scope.Benchmark)
@Warmup(iterations = 2)
@Fork(1)
@Measurement(iterations = 10)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode({ Mode.AverageTime})
public class StreamBenchmark {
  private Set<String> s1;
  private Set<String> s2;

  @Setup
  public void setUp() {
    final Set<String> valuesForA = new HashSet<>();
    final Set<String> valuesForB = new HashSet<>();
    for (int i = 0; i < 1000; ++i) {
      valuesForA.add(Integer.toString(i));
      valuesForB.add(Integer.toString(1000 + i));
    }
    s1 = valuesForA;
    s2 = valuesForB;
  }

  @Benchmark
  public void stream_concat_then_collect_using_toSet(final Blackhole blackhole) {
    final Set<String> set = Stream.concat(s1.stream(), s2.stream()).collect(Collectors.toSet());
    blackhole.consume(set);
  }

  @Benchmark
  public void s1_collect_using_toSet_then_addAll_using_toSet(final Blackhole blackhole) {
    final Set<String> set = s1.stream().collect(Collectors.toSet());
    set.addAll(s2.stream().collect(Collectors.toSet()));
    blackhole.consume(set);
  }
}

You get these result (I omitted some part for readability).

Result "s1_collect_using_toSet_then_addAll_using_toSet":
  156969,172 ±(99.9%) 4463,129 ns/op [Average]
  (min, avg, max) = (152842,561, 156969,172, 161444,532), stdev = 2952,084
  CI (99.9%): [152506,043, 161432,301] (assumes normal distribution)

Result "stream_concat_then_collect_using_toSet":
  104254,566 ±(99.9%) 4318,123 ns/op [Average]
  (min, avg, max) = (102086,234, 104254,566, 111731,085), stdev = 2856,171
  CI (99.9%): [99936,443, 108572,689] (assumes normal distribution)
# Run complete. Total time: 00:00:25

Benchmark                                                       Mode  Cnt       Score      Error  Units
StreamBenchmark.s1_collect_using_toSet_then_addAll_using_toSet  avgt   10  156969,172 ± 4463,129  ns/op
StreamBenchmark.stream_concat_then_collect_using_toSet          avgt   10  104254,566 ± 4318,123  ns/op

The version using Stream.concat(a, b).collect(toSet()) should perform faster (if I read well the JMH numbers).

On the other hand, I think this result is normal because you don't create an intermediate set (this has some cost, even with HashSet), and as said in comment of first answer, the Stream is lazily concatenated.

Using a profiler you might see in which part it is slower. You might also want to use toCollection(() -> new HashSet(1000)) instead of toSet() to see if the problem lies in growing the HashSet internal hash array.

161

answered Nov 23 '22 23:11

NoDataFound

First of all, it must be emphasized that the second variant is incorrect. The toSet() collector returns a Set with “no guarantees on the type, mutability, serializability, or thread-safety”. If mutability is not guaranteed, it is not correct to invoke addAll on the resulting Set.

It happens to work with the current version of the reference implementation, where a HashSet will be created, but might stop working in a future version or alternative implementations. In order to fix this, you have to replace toSet() with toCollection(HashSet::new) for the first Stream’s collect operation.

This leads to the situation that the second variant is not only less efficient with the current implementation, as shown in this answer, it might also prevent future optimizations made to the toSet() collector, by insisting on the result being of the exact type HashSet. Also, unlike the toSet() collector, the toCollection(…) collector has no way of detecting that the target collection is unordered, which might have a performance relevance in future implementations.

answered Nov 23 '22 23:11

Holger

Related questions
                            
                                Jackson - Serialize boolean to 1/0 instead of true/false
                            
                                How to change change text and arrow color on Toolbar?
                            
                                Round off a double while maintaining the trailing zero
                            
                                Difference between toArray(T[] a) and toArray()
                            
                                Android cannot resolve constructor intent
                            
                                How to update an Atomic based on a condition?
                            
                                Insert Row from resultSet to different database -Jdbc
                            
                                Custom JAX-RS authorization - using JWT in each request
                            
                                Intellij IDEA and Gradle projects
                            
                                Dropwizard logging: add new appender for a particular logger
                            
                                How does java.util.EnumSet<E> work?
                            
                                Java | API to get protocol://domain.port from URL
                            
                                Spark: How to save a dataframe with headers?
                            
                                Does Collectors.toSet() always return a HashSet ? What is the contract?
                            
                                Why is there no "List.reverse()" method in Java?
                            
                                Can static method access non-static instance variable?
                            
                                Whats the most elegant way to add two numbers that are Optional<BigDecimal>
                            
                                reading a csv file into a array
                            
                                java.lang.NoClassDefFoundError: org/apache/spark/Logging
                            
                                Spring Boot - "Error creating bean with name 'entityManagerFactory'" - Start

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With