So scala 2.9 recently turned up in Debian testing, bringing the newfangled parallel collections with it.
Suppose I have some code equivalent to
def expensiveFunction(x:Int):Int = {...}
def process(s:List[Int]):List[Int} = s.map(expensiveFunction)
now from the teeny bit I'd gleaned about parallel collections before the docs actually turned up on my machine, I was expecting to parallelize this just by switching the List to a ParList
... but to my surprise, there isn't one! (Just ParVector
, ParMap
, ParSet
...).
As a workround, this (or a one-line equivalent) seems to work well enough:
def process(s:List[Int]):List[Int} = {
val ps=scala.collection.parallel.immutable.ParVector()++s
val pr=ps.map(expensiveFunction)
List()++pr
}
yielding an approximately x3 performance improvement in my test code and achieving massively higher CPU usage (quad core plus hyperthreading i7). But it seems kind of clunky.
My question is a sort of an aggregated:
ParList
?ParList
, is there a
better pattern/idiom I should adopt so that
I don't feel like they're missing ?Vectors
? (I mean in C++ land
I'd generally need a pretty good reason to use
std::list
over std::vector
).List
s are great when you want pattern matching (i.e. case x :: xs
) and for efficient prepending/iteration. However, they are not so great when you want fast access-by-index, or splitting into chunks, or joining (i.e. xs ::: ys
).
Hence it does not make much sense (to have a parallel List
) when you think that this kind of thing (splitting and joining) is exactly what is needed for efficient parallelism. Use:
xs.toIndexedSeq.par
First, let me show you how to make a parallel version of that code:
def expensiveFunction(x:Int):Int = {...}
def process(s:List[Int]):Seq[Int] = s.par.map(expensiveFunction).seq
That will have Scala figure things out for you -- and, by the way, it uses ParVector. If you really want List
, call .toList
instead of .seq
.
As for the questions:
There isn't a ParList
because a List
is an intrinsically non-parallel data structure, because any operation on it requires traversal.
You should code to traits instead of classes -- Seq
, ParSeq
and GenSeq
, for example. Even performance characteristics of a List
are guaranteed by LinearSeq
.
All the books before Scala 2.8 did not have the new collections library in mind. In particular, the collections really didn't share a consistent and complete API. Now they do, and you'll gain much by taking advantage of it.
Furthermore, there wasn't a collection like Vector
in Scala 2.7 -- an immutable collection with (near) constant indexed access.
A List
cannot be easily split into various sub-lists which makes it hard to parallelise. For one, it has O(n) access; also a List
cannot strip its tail, so one need to include a length parameter.
I guess, taking a Vector
will be the better solution.
Note that Scala’s Vector
is different from std::vector
. The latter is basically a wrapper around standard array, a contiguous block in memory which needs to be copied every now and then when adding or removing data. Scala’s Vector
is a specialised data structure which allows for efficient copying and splitting while keeping the data itself immutable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With