How to get top N elements from an Apache Spark RDD for large N

Question

I have an RDD[(Int, Double)] (where Int is unique) with around 400 million entries and need to get top N. rdd.top(N)(Ordering.by(_._2)) works great for small N (tested up to 100,000), but when I need the top 1 million, I run into this error:

Total size of serialized results of 5634 tasks (1024.1 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

I understand why the error happens (although it is beyond my imagination why 1024 bytes are used to serialize a single pair (Int, Double)) and I also understand that I can overcome it by increasing spark.driver.maxResultSize, but this solution only works up to a certain N and I cannot know whether it will work or not until the whole job crashes.

How can I get the top N entries efficiently without using top or takeOrdered, since they both return Arrays that can get too big for a large N?

Scala solutions are preferred.

Holden · Accepted Answer

So there are a few solutions to this. The simplest is enabling kyro serialization which will likely reduce the amount of memory required.

Another would be using sortByKey followed with mapPartitionsWithIndex to get the count of each partition and then figuring out which partitions you need to keep and then working with the resulting RDD (this one is better if you are ok with expressing the rest of your operations on RDDs).

If you need the top n locally in the driver, you could use sortByKey and then cache the resulting RDD and use toLocalIterator.

Hope that one of these three approaches meets your needs.

How to get top N elements from an Apache Spark RDD for large N

Tags:

algorithm

apache-spark

rdd

nedim

1 Answers

Holden

Recent Activity

Donate For Us

How to get top N elements from an Apache Spark RDD for large N

Tags:

algorithm

apache-spark

rdd

nedim

1 Answers

Holden

Related questions

Recent Activity

Donate For Us