Spark sum up values regardless of keys

Question

My list of tuples looks like this:

Tup = [(u'X45', 2), (u'W80', 1), (u'F03', 2), (u'X61', 2)]

I want to sum all values up, in this case, 2+1+2+2=7

I can use Tup.reduceByKey() in spark if keys are the same. But which function can I use in spark to sum all values up regardless the key?

I've tried Tup.sum() but it give me (u'X45', 2, u'W80', 1, u'F03', 2, u'X61', 2)

BTW Due to large dataset, I want to sum it up in RDD, so I don't use Tup.collect() and sum it up out of Spark.

Knows Not Much · Accepted Answer

This is pretty easy.

Conceptually, you should first map on your original RDD and extract the 2nd value. and then sum those

In Scala

val x = List(("X45", 2), ("W80", 1), ("F03", 2), ("X61", 2))
val rdd = sc.parallelize(x)
rdd.map(_._2).sum()

In Python

x = [(u'X45', 2), (u'W80', 1), (u'F03', 2), (u'X61', 2)]
rdd = sc.parallelize(x)
y = rdd.map(lambda x : x[1]).sum()

in both cases the sum of 7 is printed.

Donate For Us