Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark: Get top N by key

Say I have a PairRDD as such (Obviously much more data in real life, assume millions of records):

val scores = sc.parallelize(Array(
      ("a", 1),  
      ("a", 2), 
      ("a", 3), 
      ("b", 3), 
      ("b", 1), 
      ("a", 4),  
      ("b", 4), 
      ("b", 2)
))

What is the most efficient way to generate a RDD with the top 2 scores per key?

val top2ByKey = ...
res3: Array[(String, Int)] = Array((a,4), (a,3), (b,4), (b,3))
like image 653
michael_erasmus Avatar asked May 11 '15 10:05

michael_erasmus


People also ask

How do I select the first 10 rows in Spark SQL?

In Spark/PySpark, you can use show() action to get the top/first N (5,10,100 ..)

What does First () do in Spark?

In Spark, the First function always returns the first element of the dataset. It is similar to take(1).

What is Take () in PySpark?

take (num: int) → List[T][source] Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.

How do I get the first row of a DataFrame Spark?

The head() operator returns the first row of the Spark Dataframe. If you need first n records, then you can use head(n). Let's look at the various versions. head() – returns first row; head(n) – return first n rows.


2 Answers

I think this should be quite efficient:

Edited according to OP comments:

scores.mapValues(p => (p, p)).reduceByKey((u, v) => {
  val values = List(u._1, u._2, v._1, v._2).sorted(Ordering[Int].reverse).distinct
  if (values.size > 1) (values(0), values(1))
  else (values(0), values(0))
}).collect().foreach(println)
like image 161
abalcerek Avatar answered Sep 25 '22 14:09

abalcerek


Since version 1.4, there is a built-in way to do this using MLLib: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala

import org.apache.spark.mllib.rdd.MLPairRDDFunctions.fromPairRDD
scores.topByKey(2)
like image 25
jbochi Avatar answered Sep 22 '22 14:09

jbochi