Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Behavior of scala fold in parallel collections

Let's run the following line of code several times:

Set(1,2,3,4,5,6,7).par.fold(0)(_ - _)

The results are quite interesting:

scala> Set(1,2,3,4,5,6,7).par.fold(0)(_ - _)
res10: Int = 8
scala> Set(1,2,3,4,5,6,7).par.fold(0)(_ - _)
res11: Int = 20

However clearly it should be like in its sequential version:

scala> Set(1,2,3,4,5,6,7).fold(0)(_ - _)
res15: Int = -28

I understand that operation - is non-associative on integers and that's the reason behind such behavior, but my question is quite simple: doesn't it mean that fold should not be parallelized in .par implementation of collections?

like image 924
tkachuko Avatar asked Aug 03 '16 03:08

tkachuko


2 Answers

When you look at the standard library documentation, you see that fold is undeterministic here:

Folds the elements of this sequence using the specified associative binary operator. The order in which operations are performed on elements is unspecified and may be nondeterministic.

As an alternative, there's foldLeft:

Applies a binary operator to a start value and all elements of this sequence, going left to right. Applies a binary operator to a start value and all elements of this collection or iterator, going left to right.

Note: might return different results for different runs, unless the underlying collection type is ordered or the operator is associative and commutative.

As Set is not an ordered collection, there's no canonical order in which the elements could be folded, so the standard library allows itself to be undeterministic even for foldLeft. If you would use an ordered sequence here, foldLeft would be deterministic in that case.

like image 115
Fabian Schmitthenner Avatar answered Oct 12 '22 13:10

Fabian Schmitthenner


The scaladoc does say:

The order in which the elements are reduced is unspecified and may be nondeterministic.

So, as you stated, a binary operation applied in ParSet#fold that is not associative is not guaranteed to produce a deterministic result. The above text is warning is all you get.

Does that mean ParSet#fold (and cousins) should not be parallelized? Not exactly. If your binary operation is commutative and you don't care about non-determinism of side-effects (not that a fold should have any), then there isn't a problem. However, you are hit with the caveat of needing to tread carefully around parallel collections.

Whether or not it is correct is more of a matter of opinion. One could argue that if a method can result in accidental non-determinism, that it should not exist in a language or library. But the alternative is to clip out functionality so that ParSet is missing functionality that is present in most of the other collection implementations. You could use the same line of thinking to also suggest the removal of Stream#foreach to prevent people from accidentally triggering infinite loops on infinite streams, but should you?

like image 44
Michael Zajac Avatar answered Oct 12 '22 12:10

Michael Zajac