How do I select a range of elements in Spark RDD?

Tags:

I'd like to select a range of elements in a Spark RDD. For example, I have an RDD with a hundred elements, and I need to select elements from 60 to 80. How do I do that?

I see that RDD has a take(i: int) method, which returns the first i elements. But there is no corresponding method to take the last i elements, or i elements from the middle starting at a certain index.

416

asked Jul 10 '14 12:07

PlinyTheElder

1 Answers

I don't think there is an efficient method to do this yet. But the easy way is using filter(), lets say you have an RDD, pairs with key value pairs and you only want elements from 60 to 80 inclusive just do.

val 60to80 = pairs.filter {     _ match {         case (k,v) => k >= 60 && k <= 80         case _ => false //incase of invalid input     } }

I think it's possible that this could be done more efficiently in the future, by using sortByKey and saving information about the range of values mapped to each partition. Keep in mind this approach would only save anything if you were planning to query the range multiple times because the sort is obviously expensive.

From looking at the spark source it would definitely be possible to do efficient range queries using RangePartitioner:

// An array of upper bounds for the first (partitions - 1) partitions   private val rangeBounds: Array[K] = {

This is a private member of RangePartitioner with the knowledge of all the upper bounds of the partitions, it would be easy to only query the necessary partitions. It looks like this is something spark users may see in the future: SPARK-911

UPDATE: Way better answer, based on pull request I'm writing for SPARK-911. It will run efficiently if the RDD is sorted and you query it multiple times.

val sorted = sc.parallelize((1 to 100).map(x => (x, x))).sortByKey().cache() val p: RangePartitioner[Int, Int] = sorted.partitioner.get.asInstanceOf[RangePartitioner[Int, Int]]; val (lower, upper) = (10, 20) val range = p.getPartition(lower) to p.getPartition(upper) println(range) val rangeFilter = (i: Int, iter: Iterator[(Int, Int)]) => {   if (range.contains(i))     for ((k, v) <- iter if k >= lower && k <= upper) yield (k, v)   else     Iterator.empty } for((k,v) <- sorted.mapPartitionsWithIndex(rangeFilter, preservesPartitioning = true).collect()) println(s"$k, $v")

If having the whole partition in memory is acceptable you could even do something like this.
val glommedAndCached = sorted.glom()cache(); glommedAndCached.map(a => a.slice(a.search(lower),a.search(upper)+1)).collect()

search is not a member BTW I just made an implicit class that has a binary search function, not shown here

200

answered Oct 13 '22 16:10

aaronman

Related questions
                            
                                How to print types of unknown size like ino_t?
                            
                                Import certificate as PrivateKeyEntry
                            
                                AWS cloudformation: One big template file or many small ones?
                            
                                Customizing Json.NET serialization: turning object into array to avoid repetition of property names
                            
                                What is spec and spec_set
                            
                                Spark-submit ClassNotFound exception
                            
                                What is behavior: url(); property in css?
                            
                                Taking reliable screenshots of websites? Phantomjs and Casperjs both return empty screen shots on some websites
                            
                                MongoDB Aggregation Performance
                            
                                How does an interpreter interpret the code?
                            
                                Django custom annotation function
                            
                                Do I need to sanitize user input before inserting in MongoDB (MongoDB+Node js combo)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With