Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with the surprising lack of ParList in scala.collections.parallel

So scala 2.9 recently turned up in Debian testing, bringing the newfangled parallel collections with it.

Suppose I have some code equivalent to

  def expensiveFunction(x:Int):Int = {...}

  def process(s:List[Int]):List[Int} = s.map(expensiveFunction)

now from the teeny bit I'd gleaned about parallel collections before the docs actually turned up on my machine, I was expecting to parallelize this just by switching the List to a ParList... but to my surprise, there isn't one! (Just ParVector, ParMap, ParSet...).

As a workround, this (or a one-line equivalent) seems to work well enough:

  def process(s:List[Int]):List[Int} = {
    val ps=scala.collection.parallel.immutable.ParVector()++s
    val pr=ps.map(expensiveFunction)
    List()++pr
  }

yielding an approximately x3 performance improvement in my test code and achieving massively higher CPU usage (quad core plus hyperthreading i7). But it seems kind of clunky.

My question is a sort of an aggregated:

  • Why isn't there a ParList ?
  • Given there isn't a ParList, is there a better pattern/idiom I should adopt so that I don't feel like they're missing ?
  • Am I just "behind the times" using Lists a lot in my scala programs (like all the Scala books I bought back in the 2.7 days taught me) and I should actually be making more use of Vectors ? (I mean in C++ land I'd generally need a pretty good reason to use std::list over std::vector).
like image 849
timday Avatar asked Jul 10 '11 16:07

timday


3 Answers

Lists are great when you want pattern matching (i.e. case x :: xs) and for efficient prepending/iteration. However, they are not so great when you want fast access-by-index, or splitting into chunks, or joining (i.e. xs ::: ys).

Hence it does not make much sense (to have a parallel List) when you think that this kind of thing (splitting and joining) is exactly what is needed for efficient parallelism. Use:

xs.toIndexedSeq.par
like image 79
oxbow_lakes Avatar answered Nov 08 '22 07:11

oxbow_lakes


First, let me show you how to make a parallel version of that code:

def expensiveFunction(x:Int):Int = {...}

def process(s:List[Int]):Seq[Int] = s.par.map(expensiveFunction).seq

That will have Scala figure things out for you -- and, by the way, it uses ParVector. If you really want List, call .toList instead of .seq.

As for the questions:

  • There isn't a ParList because a List is an intrinsically non-parallel data structure, because any operation on it requires traversal.

  • You should code to traits instead of classes -- Seq, ParSeq and GenSeq, for example. Even performance characteristics of a List are guaranteed by LinearSeq.

  • All the books before Scala 2.8 did not have the new collections library in mind. In particular, the collections really didn't share a consistent and complete API. Now they do, and you'll gain much by taking advantage of it.

    Furthermore, there wasn't a collection like Vector in Scala 2.7 -- an immutable collection with (near) constant indexed access.

like image 43
Daniel C. Sobral Avatar answered Nov 08 '22 09:11

Daniel C. Sobral


A List cannot be easily split into various sub-lists which makes it hard to parallelise. For one, it has O(n) access; also a List cannot strip its tail, so one need to include a length parameter.

I guess, taking a Vector will be the better solution.

Note that Scala’s Vector is different from std::vector. The latter is basically a wrapper around standard array, a contiguous block in memory which needs to be copied every now and then when adding or removing data. Scala’s Vector is a specialised data structure which allows for efficient copying and splitting while keeping the data itself immutable.

like image 7
Debilski Avatar answered Nov 08 '22 08:11

Debilski