I'm doing my first steps on Spark (Python) and I'm struggling with an iterator inside a groupByKey()
. I'm not able to sum the values: My code looks like this:
example = sc.parallelize([('x',1), ('x',1), ('y', 1), ('z', 1)])
example.groupByKey()
x [1,1]
y [1]
z [1]
How to have the sum on Iterator
? I tried something like below but it does not work
example.groupByKey().map(lambda (x,iterator) : (x,sum(iterator))
example.groupByKey().map(lambda (x,iterator) : (x,list(sum(iterator)))
You can simply mapValues
with sum
:
example.groupByKey().mapValues(sum)
although in this particular case reduceByKey
is much more efficient:
example.reduceByKey(lambda x, y: x + y)
or
from operator import add
example.reduceByKey(add)
Also you can do it in this way:
wordCountsGrouped = wordsGrouped.groupByKey().map(lambda (x,y):(x,map(sum,y))).map(lambda (x,y):(x,y[0]))
It is a bit late but i just found this solution
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With