Is there any built in transformation to have sum on Ints of following rdd
org.apache.spark.rdd.RDD[(String, (Int, Int))]
string is the key and Int array is Value, what i need is getting the sum of all Ints as RDD[(String, Int)]
. I tried groupByKey with no success...
Also- The result set must be again a rdd.
Thanks in advance
If the objective is to sum elements of value (Int, Int), then a map transformation can achieve it:
val arr = Array(("A", (1, 1)), ("B", (2, 2)), ("C", (3, 3))
val rdd = sc.parallelize(arr)
val result = rdd.map{ case (a, (b, c)) => (a, b + c) }
// result.collect = Array((A,2), (B,4), (C,6))
Instead if the value type is an Array, Array.sum can be used.
val rdd = sc.parallelize(Array(("A", Array(1, 1)),
("B", Array(2, 2)),
("C", Array(3, 3)))
rdd.map { case (a, b) => (a, b.sum) }
Edit:
map
transformation does not keep the original partitioner, as @Justin suggested mapValues
may be more appropriate here:
rdd.mapValues{ case (x, y) => x + y }
rdd.mapValues(_.sum)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With