Does it make sense to use parallel collections on Spark?
All the Spark examples I saw so far always used RDD of a very simple data types (Single classes and Tuples). But in fact collections and specifically parallel collections may be used as residents of the RDD.
The worker may have several cores available for execution and if a regular collection is used as RDD resident those extra cores will stay idle.
Test I ran with local manager.
val conf: SparkConf = new SparkConf().setAppName("myApp").setMaster("local[2]")
val sc = new SparkContext(conf)
val l = List(1,2,3,4,5,6,7,8)
val l1 = l.map(item => (item, 1 to item toList))
val l2 = l1.map(item => (item._1, item._2.toParArray))
val l3 = sc.parallelize(l2)
l3.sortBy(_._1).foreach(t => t._2.map(x => {println(t._1 + " " +Thread.currentThread.getName); x / 2}))
In this case when I use parArray I see 16 threads working and when I used simple Array only 2 threads worked. This may be seen as 2 workers having avaialble 8 threads.
On the other hand every logic of the parallel collection may be changed to RDD transformations of simple types.
Is using those parallel collections encouraged and considered good practice?
Is using those parallel collections encouraged and considered good practice?
Unlikely. Consider following facts:
spark.task.cpus
to explicitly ask for a specific number of threads within a task but it is a global setting and cannot be adjusted depending on the context so you effectively block resources no matter if you use them or not.Finally let's quote Reynold Xin:
Parallel collection is fairly complicated and difficult to manage (implicit thread pools). It is good for more the basic thread management, but Spark itself has much more sophisticated parallelization built-in.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With