How did Apache Spark implement its topK() API?

Question

In Apache Spark there is an RDD.top() API, which can return the top k elements from an RDD. I'd like to know how this operations is implemented. Does it first sort the RDD then return the top k values? Or does it use some other more efficient implementation?

vanekjar · Accepted Answer

No, it doesn't sort the whole RDD, that operation would be too expensive.

It will rather select TOP N elements per each partition separately using a priority queue. And then these queues are merged together in the reduce operation. That means only small part of the whole RDD is shuffled across the network.

See RDD.scala for more details.

Example:

3 input partitions
RDD.top(2)

[3, 5, 7, 10], [8, 6, 4, 12], [9, 1, 2, 11]
      ||            ||              || 
   [10, 7]        [12, 8]         [11, 9]
================== reduce ==================
                 [12, 11]

How did Apache Spark implement its topK() API?

Tags:

apache-spark

xuanyue

1 Answers

vanekjar

Recent Activity

Donate For Us

How did Apache Spark implement its topK() API?

Tags:

apache-spark

xuanyue

1 Answers

vanekjar

Related questions

Recent Activity

Donate For Us