Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How did Apache Spark implement its topK() API?

Tags:

apache-spark

In Apache Spark there is an RDD.top() API, which can return the top k elements from an RDD. I'd like to know how this operations is implemented. Does it first sort the RDD then return the top k values? Or does it use some other more efficient implementation?

like image 471
xuanyue Avatar asked Mar 16 '23 05:03

xuanyue


1 Answers

No, it doesn't sort the whole RDD, that operation would be too expensive.

It will rather select TOP N elements per each partition separately using a priority queue. And then these queues are merged together in the reduce operation. That means only small part of the whole RDD is shuffled across the network.

See RDD.scala for more details.

Example:

3 input partitions
RDD.top(2)

[3, 5, 7, 10], [8, 6, 4, 12], [9, 1, 2, 11]
      ||            ||              || 
   [10, 7]        [12, 8]         [11, 9]
================== reduce ==================
                 [12, 11]
like image 132
vanekjar Avatar answered Mar 28 '23 14:03

vanekjar