In Apache Spark there is an RDD.top()
API, which can return the top k elements from an RDD. I'd like to know how this operations is implemented. Does it first sort the RDD then return the top k values? Or does it use some other more efficient implementation?
No, it doesn't sort the whole RDD, that operation would be too expensive.
It will rather select TOP N elements per each partition separately using a priority queue. And then these queues are merged together in the reduce operation. That means only small part of the whole RDD is shuffled across the network.
See RDD.scala
for more details.
Example:
3 input partitionsRDD.top(2)
[3, 5, 7, 10], [8, 6, 4, 12], [9, 1, 2, 11]
|| || ||
[10, 7] [12, 8] [11, 9]
================== reduce ==================
[12, 11]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With