Reading Apache Spark guide at http://spark.apache.org/docs/latest/programming-guide.html it states :
Why is take function not run in parallel? What are the difficulties in implementing this type of function in parallel ? Is it something to do with fact that in order to take first n elements of RDD it is required to traverse entire RDD ?
Actually, while take
is not entirely parallel, it's not entirely sequential either.
For example let's say you take(200)
, and each partition has 10 elements. take
will first fetch partition 0 and see that it has 10 elements. It assumes that it would need 20 such partitions to get 200 elements. But it's better to ask for a bit more in a parallel request. So it wants 30 partitions, and it already has 1. So it fetches partitions 1 to 29 next, in parallel. This will likely be the last step. If it's very unlucky, and does not find a total of 200 elements, it will again make an estimate and request another batch in parallel.
Check out the code, it's well documented: https://github.com/apache/spark/blob/v1.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1049
I think the documentation is wrong. Local calculation only happens when a single partition is required. This is the case in the first pass (fetching partition 0), but typically not the case in later passes.
How would you implement it in parallel? Let's say you have 4 partitions and want to take first 5 elements. If you knew in advance the size of each partition, it would be easy: for example, if each partition has 3 elements driver asks partition 0 for all elements and it asks partition 1 for 2 elements. So the problem is that it isn't known how many elements each partition has.
Now, you could first calculate partition sizes, but this requires limiting the set of RDD transformations supported, calculating elements more than once, or some other tradeoff, and will generally need more communication overhead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With