My list of tuples looks like this:
Tup = [(u'X45', 2), (u'W80', 1), (u'F03', 2), (u'X61', 2)]
I want to sum all values up, in this case, 2+1+2+2=7
I can use Tup.reduceByKey()
in spark if keys are the same. But which function can I use in spark to sum all values up regardless the key?
I've tried Tup.sum()
but it give me (u'X45', 2, u'W80', 1, u'F03', 2, u'X61', 2)
BTW Due to large dataset, I want to sum it up in RDD, so I don't use Tup.collect()
and sum
it up out of Spark.
This is pretty easy.
Conceptually, you should first map on your original RDD and extract the 2nd value. and then sum those
In Scala
val x = List(("X45", 2), ("W80", 1), ("F03", 2), ("X61", 2))
val rdd = sc.parallelize(x)
rdd.map(_._2).sum()
In Python
x = [(u'X45', 2), (u'W80', 1), (u'F03', 2), (u'X61', 2)]
rdd = sc.parallelize(x)
y = rdd.map(lambda x : x[1]).sum()
in both cases the sum of 7 is printed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With