Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performing sum on a rdd int array

Tags:

apache-spark

Is there any built in transformation to have sum on Ints of following rdd

org.apache.spark.rdd.RDD[(String, (Int, Int))]

string is the key and Int array is Value, what i need is getting the sum of all Ints as RDD[(String, Int)] . I tried groupByKey with no success...

Also- The result set must be again a rdd.

Thanks in advance

like image 907
Adam Right Avatar asked Mar 17 '23 12:03

Adam Right


1 Answers

If the objective is to sum elements of value (Int, Int), then a map transformation can achieve it:

val arr = Array(("A", (1, 1)), ("B", (2, 2)), ("C", (3, 3))

val rdd = sc.parallelize(arr)

val result = rdd.map{ case (a, (b, c)) => (a, b + c) }

// result.collect = Array((A,2), (B,4), (C,6))

Instead if the value type is an Array, Array.sum can be used.

val rdd = sc.parallelize(Array(("A", Array(1, 1)), 
                               ("B", Array(2, 2)), 
                               ("C", Array(3, 3)))

rdd.map { case (a, b) => (a, b.sum) }

Edit:

map transformation does not keep the original partitioner, as @Justin suggested mapValues may be more appropriate here:

rdd.mapValues{ case (x, y) => x + y }
rdd.mapValues(_.sum) 
like image 99
Shyamendra Solanki Avatar answered Apr 02 '23 16:04

Shyamendra Solanki