What is the best way to return the max row (value) associated with each unique key in a spark RDD?
I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF?
I have in RDD format:
[(v, 3),
(v, 1),
(v, 1),
(w, 7),
(w, 1),
(x, 3),
(y, 1),
(y, 1),
(y, 2),
(y, 3)]
And I need to return:
[(v, 3),
(w, 7),
(x, 3),
(y, 3)]
Ties can return the first value or random.
Basically the max function orders by the return value of the lambda function. Here a is a pair RDD with elements such as ('key',int) and x[1] just refers to the integer part of the element. Note that the max function by itself will order by key and return the max value.
We can get the maximum value from the column in the dataframe using the select() method. Using the max() method, we can get the maximum value from the column.
colStats() returns an instance of MultivariateStatisticalSummary , which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.
cogroup() can be used for much more than just implementing joins. We can also use it to implement intersect by key. Additionally, cogroup() can work on three or more RDDs at once.
Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:
(Scala)
val grouped = rdd.reduceByKey(math.max(_, _))
(Python)
grouped = rdd.reduceByKey(max)
(Java 7)
JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer v1, Integer v2) {
return Math.max(v1, v2);
}
});
(Java 8)
JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
(v1, v2) -> Math.max(v1, v2)
);
API doc for reduceByKey:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With