See the simple example below that counts the number of occurences of each word in a list: <pre class="prettyprint"><code>Stream<String> words = Stream.of("a", "b", "a", "c"); Map<String, Integer> wordsCount = words.collect(toMap(s -> s, s -> 1, (i, j) -> i + j)); </code></pre> At the end, <code>wordsCount</code> is <code>{a=2, b=1, c=1}</code>. But my stream is very large and I want to parallelise the job, so I write: <pre class="prettyprint"><code>Map<String, Integer> wordsCount = words.parallel() .collect(toMap(s -> s, s -> 1, (i, j) -> i + j)); </code></pre> However I have noticed that <code>wordsCount</code> is a simple <code>HashMap</code> so I wonder if I need to explicitly ask for a concurrent map to ensure thread safety: <pre class="prettyprint"><code>Map<String, Integer> wordsCount = words.parallel() .collect(toConcurrentMap(s -> s, s -> 1, (i, j) -> i + j)); </code></pre> Can non-concurrent collectors be safely used with a parallel stream or should I only use the concurrent versions when collecting from a parallel stream?

All collectors, if they follow the rules in the specification, are safe to run in parallel or sequential. Parallel-readiness is a key part of the design here. The distinction between concurrent and non-concurrent collectors have to do with the approach to parallelization. An ordinary (non-concurrent) collector operates by merging sub-results. So the source is partitioned into a bunch of chunks, each chunk is collected into a result container (like a list or a map), and then the sub-results are merged into a bigger result container. This is safe and order-preserving, but for some kinds of containers -- especially maps -- can be expensive, since merging two maps by key is often expensive. A concurrent collector instead creates one result container, whose insertion operations are guaranteed to be thread-safe, and blasts elements into it from multiple threads. With a highly concurrent result container like ConcurrentHashMap, this approach may well perform better than merging ordinary HashMaps. So, the concurrent collectors are strictly optimizations over their ordinary counterparts. And they don't come without a cost; because elements are being blasted in from many threads, concurrent collectors generally cannot preserve encounter order. (But, often you don't care -- when creating a word count histogram, you don't care which instance of "foo" you counted first.)

Parallel streams, collectors and thread safety

Tags:

java

concurrency

parallel-processing

java-8

java-stream

See the simple example below that counts the number of occurences of each word in a list:

Stream<String> words = Stream.of("a", "b", "a", "c"); Map<String, Integer> wordsCount = words.collect(toMap(s -> s, s -> 1,                                                       (i, j) -> i + j));

At the end, wordsCount is {a=2, b=1, c=1}.

But my stream is very large and I want to parallelise the job, so I write:

Map<String, Integer> wordsCount = words.parallel()                                        .collect(toMap(s -> s, s -> 1,                                                       (i, j) -> i + j));

However I have noticed that wordsCount is a simple HashMap so I wonder if I need to explicitly ask for a concurrent map to ensure thread safety:

Map<String, Integer> wordsCount = words.parallel()                                        .collect(toConcurrentMap(s -> s, s -> 1,                                                                 (i, j) -> i + j));

Can non-concurrent collectors be safely used with a parallel stream or should I only use the concurrent versions when collecting from a parallel stream?

535

asked Mar 12 '14 11:03

assylias

2 Answers

Can non-concurrent collectors be safely used with a parallel stream or should I only use the concurrent versions when collecting from a parallel stream?

It is safe to use a non-concurrent collector in a collect operation of a parallel stream.

In the specification of the Collector interface, in the section with half a dozen bullet points, is this:

For non-concurrent collectors, any result returned from the result supplier, accumulator, or combiner functions must be serially thread-confined. This enables collection to occur in parallel without the Collector needing to implement any additional synchronization. The reduction implementation must manage that the input is properly partitioned, that partitions are processed in isolation, and combining happens only after accumulation is complete.

This means that the various implementations provided by the Collectors class can be used with parallel streams, even though some of those implementations might not be concurrent collectors. This also applies to any of your own non-concurrent collectors that you might implement. They can be used safely with parallel streams, provided your collectors don't interfere with the stream source, are side-effect free, order independent, etc.

I also recommend reading the Mutable Reduction section of the java.util.stream package documentation. In the middle of this section is an example that is stated to be parallelizable, but which collects results into an ArrayList, which is not thread-safe.

The way this works is that a parallel stream ending in a non-concurrent collector makes sure that different threads are always operating on different instances of the intermediate result collections. That's why a collector has a Supplier function, for creating as many intermediate collections as there are threads, so each thread can accumulate into its own. When intermediate results are to be merged, they are handed off safely between threads, and at any given time only a single thread is merging any pair of intermediate results.

169

answered Oct 08 '22 13:10

Stuart Marks

All collectors, if they follow the rules in the specification, are safe to run in parallel or sequential. Parallel-readiness is a key part of the design here.

The distinction between concurrent and non-concurrent collectors have to do with the approach to parallelization.

An ordinary (non-concurrent) collector operates by merging sub-results. So the source is partitioned into a bunch of chunks, each chunk is collected into a result container (like a list or a map), and then the sub-results are merged into a bigger result container. This is safe and order-preserving, but for some kinds of containers -- especially maps -- can be expensive, since merging two maps by key is often expensive.

A concurrent collector instead creates one result container, whose insertion operations are guaranteed to be thread-safe, and blasts elements into it from multiple threads. With a highly concurrent result container like ConcurrentHashMap, this approach may well perform better than merging ordinary HashMaps.

So, the concurrent collectors are strictly optimizations over their ordinary counterparts. And they don't come without a cost; because elements are being blasted in from many threads, concurrent collectors generally cannot preserve encounter order. (But, often you don't care -- when creating a word count histogram, you don't care which instance of "foo" you counted first.)

answered Oct 08 '22 12:10

Brian Goetz

Related questions
                            
                                Java Swing: Displaying images from within a Jar
                            
                                java based programming challenges [closed]
                            
                                Why Can't I access src/test/resources in Junit test run with Maven?
                            
                                Spring-Boot How to properly inject javax.validation.Validator
                            
                                Programming Java with Vim
                            
                                How to handle a static final field initializer that throws checked exception
                            
                                Bookmarks in Eclipse, set and go using hotkeys, do they exist?
                            
                                What is the spring-boot-configuration-processor ? Why do people exclude libraries from it? Why is it invisible in dependency tree?
                            
                                Java's Date(...) constructor is deprecated; what does that mean?
                            
                                What is Java written in?
                            
                                Hibernate - @ElementCollection - Strange delete/insert behavior
                            
                                Is there an invisible character that is not regarded as whitespace?
                            
                                How many ways are there to configure the Spring framework? What are the differences between them technically? (Not pros or cons..)
                            
                                Log4j2 why would you use it over log4j? [closed]
                            
                                How can I prevent Java from creating hsperfdata files?
                            
                                The POM for <name> is invalid, transitive dependencies (if any) will not be available
                            
                                How to get 5 years before now
                            
                                Weblogic or JBoss? [closed]
                            
                                Run two Java programs from Eclipse at once?
                            
                                Where is Oracle's (Sun's) JDK/JRE installed on Mac OS X 10.8 Mountain Lion?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With