Say I have a PairRDD as such (Obviously much more data in real life, assume millions of records): <pre class="prettyprint"><code>val scores = sc.parallelize(Array( ("a", 1), ("a", 2), ("a", 3), ("b", 3), ("b", 1), ("a", 4), ("b", 4), ("b", 2) )) </code></pre> What is the most efficient way to generate a RDD with the top 2 scores per key? <pre class="prettyprint"><code>val top2ByKey = ... res3: Array[(String, Int)] = Array((a,4), (a,3), (b,4), (b,3)) </code></pre>

I think this should be quite efficient: Edited according to OP comments: <pre class="prettyprint"><code>scores.mapValues(p => (p, p)).reduceByKey((u, v) => { val values = List(u._1, u._2, v._1, v._2).sorted(Ordering[Int].reverse).distinct if (values.size > 1) (values(0), values(1)) else (values(0), values(0)) }).collect().foreach(println) </code></pre>

Spark: Get top N by key

Tags:

scala

apache-spark

Say I have a PairRDD as such (Obviously much more data in real life, assume millions of records):

val scores = sc.parallelize(Array(
      ("a", 1),  
      ("a", 2), 
      ("a", 3), 
      ("b", 3), 
      ("b", 1), 
      ("a", 4),  
      ("b", 4), 
      ("b", 2)
))

What is the most efficient way to generate a RDD with the top 2 scores per key?

val top2ByKey = ...
res3: Array[(String, Int)] = Array((a,4), (a,3), (b,4), (b,3))

653

asked May 11 '15 10:05

michael_erasmus

2 Answers

I think this should be quite efficient:

Edited according to OP comments:

scores.mapValues(p => (p, p)).reduceByKey((u, v) => {
  val values = List(u._1, u._2, v._1, v._2).sorted(Ordering[Int].reverse).distinct
  if (values.size > 1) (values(0), values(1))
  else (values(0), values(0))
}).collect().foreach(println)

161

answered Sep 25 '22 14:09

abalcerek

Since version 1.4, there is a built-in way to do this using MLLib: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala

import org.apache.spark.mllib.rdd.MLPairRDDFunctions.fromPairRDD
scores.topByKey(2)

answered Sep 22 '22 14:09

jbochi

Related questions
                            
                                A variable used in its own definition?
                            
                                Slow Scala assert
                            
                                How do I set a system property for my project in sbt?
                            
                                How to return optional information from methods?
                            
                                How to specify that to build project A another project B has to be built first?
                            
                                How to combine 2 Iterators in Scala?
                            
                                what does it mean assign "_" to a field in scala?
                            
                                How to pass a class method as a parameter in Scala
                            
                                How can I fix missing conf files when using shadowJar and Scala dependencies?
                            
                                Why do we need Nil while creating List in scala? [duplicate]
                            
                                not found: value assertThrows
                            
                                Spark Dataframe change column value
                            
                                Initializing Generic Variables in Scala
                            
                                How does the NotNull trait work in 2.8 and does anyone actually use it?
                            
                                Scala, advanced generic extending
                            
                                Scala SBT: scala compiler version
                            
                                How to instantiate inner classes in one step in Scala?
                            
                                Accessing a value from the outer scope when a local member with the same name is present
                            
                                Why should one prefer Option for error handling over exceptions in Scala?
                            
                                Scala generate unique pairs from list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With