I have a list of 1 million objects, and I need to populate that into a Map. Now, I want to reduce the time for populating this into a Map, and for this I am planning on using Java 8 parallelstream() like this:
List<Person> list = new LinkedList<>();
Map<String, String> map = new HashMap<>();
list.parallelStream().forEach(person ->{
map.put(person.getName(), person.getAge());
});
I want to ask is it safe to populate a Map like this through parallel threads. Isn't it possible to have concurrency issues, and some data may get lost in the Map ?
An operation on a ParallelStream is still blocking and will wait for all the threads it spawned to finish. These threads are executed asynchronously (they don't wait for a previous one to finish), but that doesn't mean your whole code starts behaving asynchronously !
A sequential stream is executed in a single thread running on one CPU core. The elements in the stream are processed sequentially in a single pass by the stream operations that are executed in the same thread. A parallel stream is executed by different threads, running on multiple CPU cores in a computer.
Hence, a stream with very many elements will take a performance hit because of this. Also, lambdas which produce side effects makes parallel-running streams hazardous to thread safety.
1. Parallel Streams can actually slow you down. Java 8 brings the promise of parallelism as one of the most anticipated new features.
It is very safe to use parallelStream()
to collect into a HashMap
. However, it is not safe to use parallelStream()
, forEach
and a consumer adding things to a HashMap
.
HashMap
is not a synchronized class, and trying to put elements in it concurrently will not work properly. This is what forEach
will do, it will invoke the given consumer, which puts elements into the HashMap
, from multiple threads, possibly at the same time. If you want a simple code demonstrating the issue:
List<Integer> list = IntStream.range(0, 10000).boxed().collect(Collectors.toList());
Map<Integer, Integer> map = new HashMap<>();
list.parallelStream().forEach(i -> {
map.put(i, i);
});
System.out.println(list.size());
System.out.println(map.size());
Make sure to run it a couple of times. There's a very good chance (the joy of concurrency) that the printed map size after the operation is not 10000, which is the size of the list, but slightly less.
The solution here, as always, is not to use forEach
, but to use a mutable reduction approach with the collect
method and the built-in toMap
:
Map<Integer, Integer> map = list.parallelStream().collect(Collectors.toMap(i -> i, i -> i));
Use that line of code in the sample code above, and you can rest assured that the map size will always be 10000. The Stream API ensures that it is safe to collect into a non-thread safe container, even in parallel. Which also means that you don't need to use toConcurrentMap
to be safe, this collector is needed if you specifically want a ConcurrentMap
as result, not a general Map
; but as far as thread safety is concerned with regard to collect
, you can use both.
HashMap
isn't threadsafe, but ConcurrentHashMap
is; use that instead
Map<String, String> map = new ConcurrentHashMap<>();
and your code will work as expected.
forEach()
vs toMap()
After JVM warm-up, with 1M elements, using parallel streams and using median timings, the forEach()
version was consistently 2-3 times faster than the toMap()
version.
Results were consistent between all-unique, 25% duplicate and 100% duplicate inputs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With