It seems like when I invoke map
on a parallel list, the operation runs in parallel, but when I do filter
on that list, the operation runs strictly sequentially. So to make filter
parallel, I first do map to (A,Boolean), then filter those tuples, and map all back again. It feels not very convenient.
So I am interested - which operations on parallel collections are parallelized and which are not?
Parallelized collections are created by calling SparkContext 's parallelize method on an existing collection in your driver program (a Scala Seq ). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.
The par method on collection provides a very easy high level API to allow computation to run in parallel to take advantage of multi-core processing. When you call the par method on a collection, it will copy all the elements into an equivalent Scala Parallel Collection.
There are no parallel lists. Calling par
on a List
converts the List
into the default parallel immutable sequence - a ParVector
. This conversion proceeds sequentially. Both the filter
and the map
should then be parallel.
scala> import scala.collection._
import scala.collection._
scala> List(1, 2, 3).par.filter { x => println(Thread.currentThread); x > 0 }
Thread[ForkJoinPool-1-worker-5,5,main]
Thread[ForkJoinPool-1-worker-3,5,main]
Thread[ForkJoinPool-1-worker-0,5,main]
res0: scala.collection.parallel.immutable.ParSeq[Int] = ParVector(1, 2, 3)
Perhaps you've concluded that the filter
is not parallel, because you've measured both the conversion time and the filter
time.
Some operations not parallelized currently: sort*
variants, indexOfSlice
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With