Why Apache Spark take function not parallel?

Question

Reading Apache Spark guide at http://spark.apache.org/docs/latest/programming-guide.html it states :

enter image description here

Why is take function not run in parallel? What are the difficulties in implementing this type of function in parallel ? Is it something to do with fact that in order to take first n elements of RDD it is required to traverse entire RDD ?

Daniel Darabos · Accepted Answer

Actually, while take is not entirely parallel, it's not entirely sequential either.

For example let's say you take(200), and each partition has 10 elements. take will first fetch partition 0 and see that it has 10 elements. It assumes that it would need 20 such partitions to get 200 elements. But it's better to ask for a bit more in a parallel request. So it wants 30 partitions, and it already has 1. So it fetches partitions 1 to 29 next, in parallel. This will likely be the last step. If it's very unlucky, and does not find a total of 200 elements, it will again make an estimate and request another batch in parallel.

Check out the code, it's well documented: https://github.com/apache/spark/blob/v1.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1049

I think the documentation is wrong. Local calculation only happens when a single partition is required. This is the case in the first pass (fetching partition 0), but typically not the case in later passes.

Alexey Romanov · Answer

How would you implement it in parallel? Let's say you have 4 partitions and want to take first 5 elements. If you knew in advance the size of each partition, it would be easy: for example, if each partition has 3 elements driver asks partition 0 for all elements and it asks partition 1 for 2 elements. So the problem is that it isn't known how many elements each partition has.

Now, you could first calculate partition sizes, but this requires limiting the set of RDD transformations supported, calculating elements more than once, or some other tradeoff, and will generally need more communication overhead.

Why Apache Spark take function not parallel?

Tags:

parallel-processing

scala

apache-spark

blue-sky

2 Answers

Daniel Darabos

Alexey Romanov

Recent Activity

Donate For Us

Why Apache Spark take function not parallel?

Tags:

parallel-processing

scala

apache-spark

blue-sky

2 Answers

Daniel Darabos

Alexey Romanov

Related questions

Recent Activity

Donate For Us