Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark finding max value and the associated key

My question is based upon this question. I have a spark pair RDD (key, count): [(a,1), (b,2), (c,1), (d,3)].

How can I find the both the key with highest count and the actual count?

like image 297
user2543622 Avatar asked Feb 26 '16 02:02

user2543622


2 Answers

(sc
    .parallelize([("a",1), ("b",5), ("c",1), ("d",3)])
    .max(key=lambda x:x[1]))

does return ('b', 5), not only 5. The first parameter of max is the key used for comparison (explicited here), but max still returns the whole value, here the complete tuple.

like image 114
Quentin Pradet Avatar answered Sep 19 '22 13:09

Quentin Pradet


val myRDD = sc.parallelize(Array(
     |      | ("a",1),
     |      | ("b",5),
     |      | ("c",1),
     |      | ("d",3))).sortBy(_._2,false).take(1)

Sorting on the values in descending order and taking topmost element.

like image 44
user1501308 Avatar answered Sep 17 '22 13:09

user1501308