Is using parallel collections encouraged in Spark

Question

Does it make sense to use parallel collections on Spark?

All the Spark examples I saw so far always used RDD of a very simple data types (Single classes and Tuples). But in fact collections and specifically parallel collections may be used as residents of the RDD.

The worker may have several cores available for execution and if a regular collection is used as RDD resident those extra cores will stay idle.

Test I ran with local manager.

val conf: SparkConf = new SparkConf().setAppName("myApp").setMaster("local[2]")
val sc = new SparkContext(conf)

val l = List(1,2,3,4,5,6,7,8)
val l1 = l.map(item => (item, 1 to item toList))
val l2 = l1.map(item => (item._1, item._2.toParArray))
val l3 = sc.parallelize(l2)
l3.sortBy(_._1).foreach(t => t._2.map(x => {println(t._1 + " " +Thread.currentThread.getName); x / 2}))

In this case when I use parArray I see 16 threads working and when I used simple Array only 2 threads worked. This may be seen as 2 workers having avaialble 8 threads.

On the other hand every logic of the parallel collection may be changed to RDD transformations of simple types.

Is using those parallel collections encouraged and considered good practice?

zero323 · Accepted Answer

Is using those parallel collections encouraged and considered good practice?

Unlikely. Consider following facts:

Any parallel execution inside a task is completely opaque for the resource manager and as result it cannot automatically allocate required resources.
You can use spark.task.cpus to explicitly ask for a specific number of threads within a task but it is a global setting and cannot be adjusted depending on the context so you effectively block resources no matter if you use them or not.
If threads underutilization is a valid concern you can always increase number of partitions.

Finally let's quote Reynold Xin:

Parallel collection is fairly complicated and difficult to manage (implicit thread pools). It is good for more the basic thread management, but Spark itself has much more sophisticated parallelization built-in.

Is using parallel collections encouraged in Spark

Tags:

parallel-processing

scala

apache-spark

antonpuz

1 Answers

zero323

Recent Activity

Donate For Us

Is using parallel collections encouraged in Spark

Tags:

parallel-processing

scala

apache-spark

antonpuz

1 Answers

zero323

Related questions

Recent Activity

Donate For Us