Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Apache Spark take function not parallel?

Reading Apache Spark guide at http://spark.apache.org/docs/latest/programming-guide.html it states :

enter image description here

Why is take function not run in parallel? What are the difficulties in implementing this type of function in parallel ? Is it something to do with fact that in order to take first n elements of RDD it is required to traverse entire RDD ?

like image 332
blue-sky Avatar asked Dec 20 '22 07:12

blue-sky


2 Answers

Actually, while take is not entirely parallel, it's not entirely sequential either.

For example let's say you take(200), and each partition has 10 elements. take will first fetch partition 0 and see that it has 10 elements. It assumes that it would need 20 such partitions to get 200 elements. But it's better to ask for a bit more in a parallel request. So it wants 30 partitions, and it already has 1. So it fetches partitions 1 to 29 next, in parallel. This will likely be the last step. If it's very unlucky, and does not find a total of 200 elements, it will again make an estimate and request another batch in parallel.

Check out the code, it's well documented: https://github.com/apache/spark/blob/v1.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1049

I think the documentation is wrong. Local calculation only happens when a single partition is required. This is the case in the first pass (fetching partition 0), but typically not the case in later passes.

like image 187
Daniel Darabos Avatar answered Jan 15 '23 23:01

Daniel Darabos


How would you implement it in parallel? Let's say you have 4 partitions and want to take first 5 elements. If you knew in advance the size of each partition, it would be easy: for example, if each partition has 3 elements driver asks partition 0 for all elements and it asks partition 1 for 2 elements. So the problem is that it isn't known how many elements each partition has.

Now, you could first calculate partition sizes, but this requires limiting the set of RDD transformations supported, calculating elements more than once, or some other tradeoff, and will generally need more communication overhead.

like image 34
Alexey Romanov Avatar answered Jan 16 '23 00:01

Alexey Romanov