It seems like when I invoke map on a parallel list, the operation runs in parallel, but when I do filter on that list, the operation runs strictly sequentially. So to make filter parallel, I first do map to (A,Boolean), then filter those tuples, and map all back again. It feels not very convenient.
So I am interested - which operations on parallel collections are parallelized and which are not?
Parallelized collections are created by calling SparkContext 's parallelize method on an existing collection in your driver program (a Scala Seq ). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.
The par method on collection provides a very easy high level API to allow computation to run in parallel to take advantage of multi-core processing. When you call the par method on a collection, it will copy all the elements into an equivalent Scala Parallel Collection.
There are no parallel lists. Calling par on a List converts the List into the default parallel immutable sequence - a ParVector. This conversion proceeds sequentially. Both the filter and the map should then be parallel.
scala> import scala.collection._
import scala.collection._
scala> List(1, 2, 3).par.filter { x => println(Thread.currentThread); x > 0 }
Thread[ForkJoinPool-1-worker-5,5,main]
Thread[ForkJoinPool-1-worker-3,5,main]
Thread[ForkJoinPool-1-worker-0,5,main]
res0: scala.collection.parallel.immutable.ParSeq[Int] = ParVector(1, 2, 3)
Perhaps you've concluded that the filter is not parallel, because you've measured both the conversion time and the filter time.
Some operations not parallelized currently: sort* variants, indexOfSlice.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With