Java stream API offers a general <code>.reduce(identity, accumulator)</code> method. It is pretty clear from javadocs that accumulator should be a stateless function. However, I have a question about the <code>identity</code> object, namely should it be thread-safe? Lets say an <code>identity</code> is a java object and an <code>accumulator</code> modifies that object in a way that the modification is not atomic, e.g. <code>accumulator</code> looks into <code>identity's</code> state and then decides how to modify it's internal state exactly. Clearly it can happen that several reduce operations may run at the same time. In this case several questions arise: <ul> <li>should this one reduce operation be atomic in the scope of the <code>identity</code> object?</li> <li>is it enough just to make <code>identity</code> object immutable and return a new instance upon each reduce?</li> </ul>

Ordinarily, <code>accumulator</code> is an english word that means: "You are completely hosed if you want parallelism". It's right there in the word: To accumulate - to gather over time. There is no way to do it right except to start from the beginning, and apply accumulation until you are done. But, java gets around this by adding 2 requirements: <ol> <li>associativity. <code>a X (b X c)</code> must produce the same result as <code>(a X b) X c</code>, where X is the accumulator function.</li> <li>identity function. <code>ident X a</code> must be equal to <code>a</code>, where <code>ident</code> is the identity you pass to <code>reduce</code> and X is the accumulator function.</li> </ol> Let's use as example the function <code>(a, b) -> a + b</code> and as identity <code>0</code>, which fulfills both of these requirements if your intent is to sum a list. Java can parallellize this by just summing arbitrary terms and then summing the results of these. <code>[1, 5, 9, 12]</code> can be summed by first lopping the list into two, then handing these 2 sublists to threads to individually sum, and then summing the answers each thread provides. This implies that java will start accumulation multiple times at arbitrary points in the stream, and will apply identity as part of its accumulation any number of times, at arbitrary points, and that brings swift problems if your identity object is itself mutable. There's basically no way to combine the notion of a mutable <code>identity</code> object and java's <code>reduce</code> function. It is fundamentally not designed to work that way. Contrast to the sum example: Instead of modifying a in the <code>(a, b) -> a + b</code> accumulator, neither a nor b are modified; instead, they are combined into a newly created third value, and that's how you should use this method. Contrast to <code>foldLeft</code> from certain other languages, which do not require either <code>accumulatorFunction(ident, A)</code> to be equal to A, nor associativity, but then cannot by definition parallellize it, at all. That foldLeft can be used with mutable state. For example, here is an impl of summing using a foldLeft, in pseudocode: (note that <code>new int[1]</code> is used here as mutable integer): <pre class="prettyprint"><code>int sum = stream.foldLeft(new int[1], (int[] a, int b) -> a[0] += b)[0]; </code></pre> This notion (where the LHS of your accumulator function is always the same thing, namely, your identity object, being modified to integrate each value in the stream as you move along it) is not compatible with java's reduce, and as far as I can recall, java has no (easy) way to do this kind of thing to a stream. Thus: It's worse! 'thread safe' isn't good enough, it needs to be immutable. Once it is immutable, it is trivially thread safe. <blockquote> is it enough just to make identity object immutable and return a new instance upon each reduce? </blockquote> That's not just 'good enough', that's more or less the only sane way to use <code>reduce</code>.

How to make stream reduce be thread safe?

Tags:

java

multithreading

java-stream

Java stream API offers a general .reduce(identity, accumulator) method.

It is pretty clear from javadocs that accumulator should be a stateless function.

However, I have a question about the identity object, namely should it be thread-safe?

Lets say an identity is a java object and an accumulator modifies that object in a way that the modification is not atomic, e.g. accumulator looks into identity's state and then decides how to modify it's internal state exactly. Clearly it can happen that several reduce operations may run at the same time. In this case several questions arise:

should this one reduce operation be atomic in the scope of the identity object?
is it enough just to make identity object immutable and return a new instance upon each reduce?

486

asked Feb 22 '21 15:02

fyrkov

Video Answer

2 Answers

Ordinarily, accumulator is an english word that means: "You are completely hosed if you want parallelism". It's right there in the word: To accumulate - to gather over time. There is no way to do it right except to start from the beginning, and apply accumulation until you are done.

But, java gets around this by adding 2 requirements:

associativity. a X (b X c) must produce the same result as (a X b) X c, where X is the accumulator function.
identity function. ident X a must be equal to a, where ident is the identity you pass to reduce and X is the accumulator function.

Let's use as example the function (a, b) -> a + b and as identity 0, which fulfills both of these requirements if your intent is to sum a list.

Java can parallellize this by just summing arbitrary terms and then summing the results of these. [1, 5, 9, 12] can be summed by first lopping the list into two, then handing these 2 sublists to threads to individually sum, and then summing the answers each thread provides. This implies that java will start accumulation multiple times at arbitrary points in the stream, and will apply identity as part of its accumulation any number of times, at arbitrary points, and that brings swift problems if your identity object is itself mutable.

There's basically no way to combine the notion of a mutable identity object and java's reduce function. It is fundamentally not designed to work that way.

Contrast to the sum example: Instead of modifying a in the (a, b) -> a + b accumulator, neither a nor b are modified; instead, they are combined into a newly created third value, and that's how you should use this method.

Contrast to foldLeft from certain other languages, which do not require either accumulatorFunction(ident, A) to be equal to A, nor associativity, but then cannot by definition parallellize it, at all. That foldLeft can be used with mutable state. For example, here is an impl of summing using a foldLeft, in pseudocode: (note that new int[1] is used here as mutable integer):

int sum = stream.foldLeft(new int[1], (int[] a, int b) -> a[0] += b)[0];

This notion (where the LHS of your accumulator function is always the same thing, namely, your identity object, being modified to integrate each value in the stream as you move along it) is not compatible with java's reduce, and as far as I can recall, java has no (easy) way to do this kind of thing to a stream.

Thus: It's worse! 'thread safe' isn't good enough, it needs to be immutable. Once it is immutable, it is trivially thread safe.

is it enough just to make identity object immutable and return a new instance upon each reduce?

That's not just 'good enough', that's more or less the only sane way to use reduce.

answered Oct 19 '22 08:10

rzwitserloot

This is covered by the documentation, but not directly, it is implied.

The identity value must be an identity for the accumulator function. This means that for all t, accumulator.apply(identity, t) is equal to t.

As soon as identity is modified, like you say, even if in a thread-safe way, the rule above is violated; thus no guarantees of the expected result.

For the second question the answer is slightly more involved. You do not have to make the identity immutable, as long as no one abuses that (by modifying its internal state). Of course making it immutable helps a lot in that regard.

answered Oct 19 '22 10:10

Eugene

Related questions
                            
                                package org.apache.commons.io does not exist error
                            
                                How to inject CrudRepository in Spring JPA?
                            
                                jersey 2: How to create custom HTTP param binding
                            
                                RESTEasy Client + NoSuchMethodError
                            
                                How to mock ResourceBundle.getString()?
                            
                                How do I properly include an external jar file for a cordova plugin?
                            
                                How is concatenation of final strings done in Java?
                            
                                Compilation warning: Unchecked call to XXX as member of the raw type
                            
                                spring pass value from property file to annotation
                            
                                How can I run Play framework in HTTPS only in the dev mode?
                            
                                Min/Max Date/DateTime in JodaTime
                            
                                Spring Security - Access is denied (user is not anonymous) spring-security-core-4.0.3.RELEASE
                            
                                RxJava - Schedulers vs ExecutorService?
                            
                                Lower-bounded wild card causes trouble in javac, but not Eclipse
                            
                                @SuppressWarnings vs @SuppressLint
                            
                                Gradle: increase heap size for java process started by gradle run task
                            
                                Pessimistic locking vs Serializable transaction isolation level
                            
                                How to start flyway after database initialization in Docker
                            
                                Will parallel stream work fine with distinct operation?
                            
                                Maintain an ArrayList of unique arrays in java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With