Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get top N elements from an Apache Spark RDD for large N

I have an RDD[(Int, Double)] (where Int is unique) with around 400 million entries and need to get top N. rdd.top(N)(Ordering.by(_._2)) works great for small N (tested up to 100,000), but when I need the top 1 million, I run into this error:

Total size of serialized results of 5634 tasks (1024.1 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

I understand why the error happens (although it is beyond my imagination why 1024 bytes are used to serialize a single pair (Int, Double)) and I also understand that I can overcome it by increasing spark.driver.maxResultSize, but this solution only works up to a certain N and I cannot know whether it will work or not until the whole job crashes.

How can I get the top N entries efficiently without using top or takeOrdered, since they both return Arrays that can get too big for a large N?

Scala solutions are preferred.

like image 576
nedim Avatar asked Dec 05 '25 03:12

nedim


1 Answers

So there are a few solutions to this. The simplest is enabling kyro serialization which will likely reduce the amount of memory required.

Another would be using sortByKey followed with mapPartitionsWithIndex to get the count of each partition and then figuring out which partitions you need to keep and then working with the resulting RDD (this one is better if you are ok with expressing the rest of your operations on RDDs).

If you need the top n locally in the driver, you could use sortByKey and then cache the resulting RDD and use toLocalIterator.

Hope that one of these three approaches meets your needs.

like image 141
Holden Avatar answered Dec 07 '25 20:12

Holden



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!