Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is using parallel collections encouraged in Spark

Does it make sense to use parallel collections on Spark?

All the Spark examples I saw so far always used RDD of a very simple data types (Single classes and Tuples). But in fact collections and specifically parallel collections may be used as residents of the RDD.

The worker may have several cores available for execution and if a regular collection is used as RDD resident those extra cores will stay idle.

Test I ran with local manager.

val conf: SparkConf = new SparkConf().setAppName("myApp").setMaster("local[2]")
val sc = new SparkContext(conf)

val l = List(1,2,3,4,5,6,7,8)
val l1 = l.map(item => (item, 1 to item toList))
val l2 = l1.map(item => (item._1, item._2.toParArray))
val l3 = sc.parallelize(l2)
l3.sortBy(_._1).foreach(t => t._2.map(x => {println(t._1 + " " +Thread.currentThread.getName); x / 2}))

In this case when I use parArray I see 16 threads working and when I used simple Array only 2 threads worked. This may be seen as 2 workers having avaialble 8 threads.

On the other hand every logic of the parallel collection may be changed to RDD transformations of simple types.

Is using those parallel collections encouraged and considered good practice?

like image 727
antonpuz Avatar asked Oct 17 '25 07:10

antonpuz


1 Answers

Is using those parallel collections encouraged and considered good practice?

Unlikely. Consider following facts:

  • Any parallel execution inside a task is completely opaque for the resource manager and as result it cannot automatically allocate required resources.
  • You can use spark.task.cpus to explicitly ask for a specific number of threads within a task but it is a global setting and cannot be adjusted depending on the context so you effectively block resources no matter if you use them or not.
  • If threads underutilization is a valid concern you can always increase number of partitions.

Finally let's quote Reynold Xin:

Parallel collection is fairly complicated and difficult to manage (implicit thread pools). It is good for more the basic thread management, but Spark itself has much more sophisticated parallelization built-in.

like image 142
zero323 Avatar answered Oct 20 '25 14:10

zero323



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!