For the purpose of combining two sets of data in a stream.
Stream.concat(stream1, stream2).collect(Collectors.toSet());
Or
stream1.collect(Collectors.toSet())
.addAll(stream2.collect(Collectors.toSet()));
Which is more efficient and why?
In Java8 Streams, performance is achieved by parallelism, laziness, and using short-circuit operations, but there is a downside as well, and we need to be very cautious while choosing Streams, as it may degrade the performance of your application.
parallel foreach() Works on multithreading concept: The only difference between stream(). forEach() and parallel foreach() is the multithreading feature given in the parallel forEach(). This is way faster that foreach() and stream.
Using the concat() Method. The static method concat() combines two Streams logically by creating a lazily concatenated Stream whose elements are all the elements of the first Stream followed by all the elements of the second Stream.
concat() in Java. Stream. concat() method creates a concatenated stream in which the elements are all the elements of the first stream followed by all the elements of the second stream. The resulting stream is ordered if both of the input streams are ordered, and parallel if either of the input streams is parallel.
For the sake of readability and intention, Stream.concat(a, b).collect(toSet())
is way clearer than the second alternative.
For the sake of the question, which is "what is the most efficient", here a JMH test (I'd like to say that I don't use JMH that much, there might be some room to improve my benchmark test):
Using JMH, with the following code:
package stackoverflow;
import java.util.HashSet;
import java.util.Set;
import java.util.concurrent.TimeUnit;
import java.util.stream.Collectors;
import java.util.stream.Stream;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
import org.openjdk.jmh.infra.Blackhole;
@State(Scope.Benchmark)
@Warmup(iterations = 2)
@Fork(1)
@Measurement(iterations = 10)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode({ Mode.AverageTime})
public class StreamBenchmark {
private Set<String> s1;
private Set<String> s2;
@Setup
public void setUp() {
final Set<String> valuesForA = new HashSet<>();
final Set<String> valuesForB = new HashSet<>();
for (int i = 0; i < 1000; ++i) {
valuesForA.add(Integer.toString(i));
valuesForB.add(Integer.toString(1000 + i));
}
s1 = valuesForA;
s2 = valuesForB;
}
@Benchmark
public void stream_concat_then_collect_using_toSet(final Blackhole blackhole) {
final Set<String> set = Stream.concat(s1.stream(), s2.stream()).collect(Collectors.toSet());
blackhole.consume(set);
}
@Benchmark
public void s1_collect_using_toSet_then_addAll_using_toSet(final Blackhole blackhole) {
final Set<String> set = s1.stream().collect(Collectors.toSet());
set.addAll(s2.stream().collect(Collectors.toSet()));
blackhole.consume(set);
}
}
You get these result (I omitted some part for readability).
Result "s1_collect_using_toSet_then_addAll_using_toSet":
156969,172 ±(99.9%) 4463,129 ns/op [Average]
(min, avg, max) = (152842,561, 156969,172, 161444,532), stdev = 2952,084
CI (99.9%): [152506,043, 161432,301] (assumes normal distribution)
Result "stream_concat_then_collect_using_toSet":
104254,566 ±(99.9%) 4318,123 ns/op [Average]
(min, avg, max) = (102086,234, 104254,566, 111731,085), stdev = 2856,171
CI (99.9%): [99936,443, 108572,689] (assumes normal distribution)
# Run complete. Total time: 00:00:25
Benchmark Mode Cnt Score Error Units
StreamBenchmark.s1_collect_using_toSet_then_addAll_using_toSet avgt 10 156969,172 ± 4463,129 ns/op
StreamBenchmark.stream_concat_then_collect_using_toSet avgt 10 104254,566 ± 4318,123 ns/op
The version using Stream.concat(a, b).collect(toSet())
should perform faster (if I read well the JMH numbers).
On the other hand, I think this result is normal because you don't create an intermediate set (this has some cost, even with HashSet
), and as said in comment of first answer, the Stream
is lazily concatenated.
Using a profiler you might see in which part it is slower. You might also want to use toCollection(() -> new HashSet(1000))
instead of toSet()
to see if the problem lies in growing the HashSet
internal hash array.
First of all, it must be emphasized that the second variant is incorrect. The toSet()
collector returns a Set
with “no guarantees on the type, mutability, serializability, or thread-safety”. If mutability is not guaranteed, it is not correct to invoke addAll
on the resulting Set
.
It happens to work with the current version of the reference implementation, where a HashSet
will be created, but might stop working in a future version or alternative implementations. In order to fix this, you have to replace toSet()
with toCollection(HashSet::new)
for the first Stream’s collect
operation.
This leads to the situation that the second variant is not only less efficient with the current implementation, as shown in this answer, it might also prevent future optimizations made to the toSet()
collector, by insisting on the result being of the exact type HashSet
. Also, unlike the toSet()
collector, the toCollection(…)
collector has no way of detecting that the target collection is unordered, which might have a performance relevance in future implementations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With