How to sum values in an iterator in a PySpark groupByKey()

Question

I'm doing my first steps on Spark (Python) and I'm struggling with an iterator inside a groupByKey(). I'm not able to sum the values: My code looks like this:

example = sc.parallelize([('x',1), ('x',1), ('y', 1), ('z', 1)])

example.groupByKey()

x [1,1]
y [1]
z [1]

How to have the sum on Iterator? I tried something like below but it does not work

example.groupByKey().map(lambda (x,iterator) : (x,sum(iterator))
example.groupByKey().map(lambda (x,iterator) : (x,list(sum(iterator)))

zero323 · Accepted Answer

You can simply mapValues with sum:

example.groupByKey().mapValues(sum)

although in this particular case reduceByKey is much more efficient:

example.reduceByKey(lambda x, y: x + y)

or

from operator import add

example.reduceByKey(add)

mgarciaibanez · Answer

Also you can do it in this way:

wordCountsGrouped = wordsGrouped.groupByKey().map(lambda (x,y):(x,map(sum,y))).map(lambda (x,y):(x,y[0]))

It is a bit late but i just found this solution

How to sum values in an iterator in a PySpark groupByKey()

Tags:

python

iterator

apache-spark

rdd

pyspark

Leonida Gianfagna

2 Answers

zero323

mgarciaibanez

Recent Activity

Donate For Us

How to sum values in an iterator in a PySpark groupByKey()

Tags:

python

iterator

apache-spark

rdd

pyspark

Leonida Gianfagna

2 Answers

zero323

mgarciaibanez

Related questions

Recent Activity

Donate For Us