Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get the max value for each key in a Spark RDD

What is the best way to return the max row (value) associated with each unique key in a spark RDD?

I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF?

I have in RDD format:

[(v, 3),
 (v, 1),
 (v, 1),
 (w, 7),
 (w, 1),
 (x, 3),
 (y, 1),
 (y, 1),
 (y, 2),
 (y, 3)]

And I need to return:

[(v, 3),
 (w, 7),
 (x, 3),
 (y, 3)]

Ties can return the first value or random.

like image 840
captainKirk104 Avatar asked May 04 '16 00:05

captainKirk104


People also ask

How do you find the maximum RDD?

Basically the max function orders by the return value of the lambda function. Here a is a pair RDD with elements such as ('key',int) and x[1] just refers to the integer part of the element. Note that the max function by itself will order by key and return the max value.

How does Spark calculate max value?

We can get the maximum value from the column in the dataframe using the select() method. Using the max() method, we can get the maximum value from the column.

Which RDD function returns min/max count mean?

colStats() returns an instance of MultivariateStatisticalSummary , which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.

How many RDDs can Cogroup () can work at once?

cogroup() can be used for much more than just implementing joins. We can also use it to implement intersect by key. Additionally, cogroup() can work on three or more RDDs at once.


1 Answers

Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:

(Scala)

val grouped = rdd.reduceByKey(math.max(_, _))

(Python)

grouped = rdd.reduceByKey(max)

(Java 7)

JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
    new Function2<Integer, Integer, Integer>() {
        public Integer call(Integer v1, Integer v2) {
            return Math.max(v1, v2);
    }
});

(Java 8)

JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
    (v1, v2) -> Math.max(v1, v2)
);

API doc for reduceByKey:

  • Scala
  • Python
  • Java
like image 186
Daniel de Paula Avatar answered Sep 20 '22 16:09

Daniel de Paula