I am wondering how to filter an RDD that has one of the top N values. Usually I would sort the RDD and take the <code>top</code> N items as an array in the driver to find the Nth value that can be broadcasted to filter the rdd like so: <pre class="prettyprint"><code>val topNvalues = sc.broadcast(rdd.map(_.fieldToThreshold).distict.sorted.take(N)) val threshold = topNvalues.last val rddWithTopNValues = rdd.filter(_.fieldToThreshold >= threshold) </code></pre> but in this case my N is too large, so how can I do this purely with RDDs like so?: <pre class="prettyprint"><code>def getExpensiveItems(itemPrices: RDD[(Int, Float)], count: Int): RDD[(Int, Float)] = { val sortedPrices = itemPrices.sortBy(-_._2).map(_._1).distinct // How to do this without collecting results to driver?? val highPrices = itemPrices.getTopNValuesWithoutCollect(count) itemPrices.join(highPrices.keyBy(x => x)).map(_._2._1) } </code></pre>

Use <code>zipWithIndex</code> on the sorted rdd and then filter by the index up to n items. To illustrate the case consider this rrd sorted in descending order, <pre class="prettyprint"><code>val rdd = sc.parallelize((1 to 10).map( _ => math.random)).sortBy(-_) </code></pre> Then <pre class="prettyprint"><code>rdd.zipWithIndex.filter(_._2 < 4) </code></pre> delivers the first top four items without collecting the rdd to the driver.

Spark - how to get top N of rdd as a new rdd (without collecting at the driver)

Tags:

scala

apache-spark

rdd

I am wondering how to filter an RDD that has one of the top N values. Usually I would sort the RDD and take the top N items as an array in the driver to find the Nth value that can be broadcasted to filter the rdd like so:

val topNvalues = sc.broadcast(rdd.map(_.fieldToThreshold).distict.sorted.take(N))
val threshold = topNvalues.last
val rddWithTopNValues = rdd.filter(_.fieldToThreshold >= threshold)

but in this case my N is too large, so how can I do this purely with RDDs like so?:

def getExpensiveItems(itemPrices: RDD[(Int, Float)], count: Int): RDD[(Int, Float)] = {
     val sortedPrices = itemPrices.sortBy(-_._2).map(_._1).distinct

     // How to do this without collecting results to driver??
     val highPrices = itemPrices.getTopNValuesWithoutCollect(count)

     itemPrices.join(highPrices.keyBy(x => x)).map(_._2._1)
}

873

asked Nov 29 '17 18:11

anthonybell

1 Answers

Use zipWithIndex on the sorted rdd and then filter by the index up to n items. To illustrate the case consider this rrd sorted in descending order,

val rdd = sc.parallelize((1 to 10).map( _ => math.random)).sortBy(-_)

Then

rdd.zipWithIndex.filter(_._2 < 4)

delivers the first top four items without collecting the rdd to the driver.

129

answered Jan 04 '23 15:01

elm

Related questions
                            
                                uPickle and ScalaJS: sealed trait serialisation
                            
                                Scala: How to ignore 'SSLHandshakeException'
                            
                                Use an array as a Scala foldLeft accumulator
                            
                                Avoiding deeply nested Option cascades in Scala
                            
                                Increase memory available to Spark shell
                            
                                How to transform a categorical variable in Spark into a set of columns coded as {0,1}?
                            
                                Slick 3.0: Idiomatic way to GET results from the database inside of Option (Scala Play Framework)
                            
                                Providing implicit value for singletons in Play Json library
                            
                                How using refined to express constraints with constants > 22
                            
                                Removing an element from MutableList in Scala
                            
                                How to properly stop Akka streams from the outside
                            
                                scala forward reference extends over definition
                            
                                How to complete a request in another actor when using akka-http
                            
                                Slick Plain SQL Query with Dynamic Conditions
                            
                                Logging to file in Scala/akka: ClassNotFoundException: akka.event.slf4j.Slf4jLoggingFilter
                            
                                Why can't upper-case letters be used for pattern matching for define values?
                            
                                Scala higher kinded types in implicit def fails with "could not find implicit value"
                            
                                Do Scala type lambdas cost of a reflective call?
                            
                                Exception on download: sun.security.validator.ValidatorException: No trusted certificate found
                            
                                Combine call-by-name colon operator and space incoherency [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With