Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark sum up values regardless of keys

My list of tuples looks like this:

Tup = [(u'X45', 2), (u'W80', 1), (u'F03', 2), (u'X61', 2)]

I want to sum all values up, in this case, 2+1+2+2=7

I can use Tup.reduceByKey() in spark if keys are the same. But which function can I use in spark to sum all values up regardless the key?

I've tried Tup.sum() but it give me (u'X45', 2, u'W80', 1, u'F03', 2, u'X61', 2)

BTW Due to large dataset, I want to sum it up in RDD, so I don't use Tup.collect() and sum it up out of Spark.

like image 236
catq Avatar asked Dec 08 '15 04:12

catq


1 Answers

This is pretty easy.

Conceptually, you should first map on your original RDD and extract the 2nd value. and then sum those

In Scala

val x = List(("X45", 2), ("W80", 1), ("F03", 2), ("X61", 2))
val rdd = sc.parallelize(x)
rdd.map(_._2).sum()

In Python

x = [(u'X45', 2), (u'W80', 1), (u'F03', 2), (u'X61', 2)]
rdd = sc.parallelize(x)
y = rdd.map(lambda x : x[1]).sum()

in both cases the sum of 7 is printed.

like image 183
Knows Not Much Avatar answered Nov 15 '22 06:11

Knows Not Much