Say I have a PairRDD as such (Obviously much more data in real life, assume millions of records):
val scores = sc.parallelize(Array(
("a", 1),
("a", 2),
("a", 3),
("b", 3),
("b", 1),
("a", 4),
("b", 4),
("b", 2)
))
What is the most efficient way to generate a RDD with the top 2 scores per key?
val top2ByKey = ...
res3: Array[(String, Int)] = Array((a,4), (a,3), (b,4), (b,3))
In Spark/PySpark, you can use show() action to get the top/first N (5,10,100 ..)
In Spark, the First function always returns the first element of the dataset. It is similar to take(1).
take (num: int) → List[T][source] Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.
The head() operator returns the first row of the Spark Dataframe. If you need first n records, then you can use head(n). Let's look at the various versions. head() – returns first row; head(n) – return first n rows.
I think this should be quite efficient:
Edited according to OP comments:
scores.mapValues(p => (p, p)).reduceByKey((u, v) => {
val values = List(u._1, u._2, v._1, v._2).sorted(Ordering[Int].reverse).distinct
if (values.size > 1) (values(0), values(1))
else (values(0), values(0))
}).collect().foreach(println)
Since version 1.4, there is a built-in way to do this using MLLib: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala
import org.apache.spark.mllib.rdd.MLPairRDDFunctions.fromPairRDD
scores.topByKey(2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With