Python Spark How to find cumulative sum by group using RDD API

Question

I am new to spark programming. Need help with spark python program, where i have input data like this and want to get cumulative summary for each group. Appreciate if someone guide me on this.

Input Data:

11,1,1,100

11,1,2,150

12,1,1,50

12,2,1,70

12,2,2,20

Output Data Needed like this:

11,1,1,100

11,1,2,250 //(100+150)

12,1,1,50

12,2,1,70

12,2,2,90 // (70+20)

the code i tried:

def parseline(line):
    fields = line.split(",")
    f1 = float(fields[0])
    f2 = float(fields[1])
    f3 = float(fields[2])
    f4 = float(fields[3])
    return (f1, f2, f3, f4)

input = sc.textFile("FIle:///...../a.dat")
line = input.map(parseline)
linesorted = line.sortBy(lambda x: (x[0], x[1], x[2]))
runningpremium = linesorted.map(lambda y: (((y[0], y[1]),     y[3])).reduceByKey(lambda accum, num: accum + num)

for i in runningpremium.collect():
      print i

titipata · Accepted Answer

As in the comment, you can use window function to do the cumulative sum here on Spark Dataframe. First, we can create an example dataframe with dummie columns 'a', 'b', 'c', 'd'

ls = [(11,1,1,100), (11,1,2,150), (12,1,1,50), (12,2,1,70), (12,2,2,20)]
ls_rdd = spark.sparkContext.parallelize(ls)
df = spark.createDataFrame(ls_rdd, schema=['a', 'b', 'c', 'd'])

You can partition by column a and b then order by column c. Then, apply the sum function over the column d at the end

from pyspark.sql.window import Window
import pyspark.sql.functions as func

w = Window.partitionBy([df['a'], df['b']]).orderBy(df['c'].asc())
df_cumsum = df.select('a', 'b', 'c', func.sum(df.d).over(w).alias('cum_sum'))
df_cumsum.sort(['a', 'b', 'c']).show() # simple sort column

Output

+---+---+---+-------+
|  a|  b|  c|cum_sum|
+---+---+---+-------+
| 11|  1|  1|    100|
| 11|  1|  2|    250|
| 12|  1|  1|     50|
| 12|  2|  1|     70|
| 12|  2|  2|     90|
+---+---+---+-------+

Python Spark How to find cumulative sum by group using RDD API

Tags:

python

apache-spark

rdd

pyspark

Input Data:

Output Data Needed like this:

RoyR

1 Answers

titipata

Recent Activity

Donate For Us

Python Spark How to find cumulative sum by group using RDD API

Tags:

python

apache-spark

rdd

pyspark

Input Data:

Output Data Needed like this:

RoyR

1 Answers

titipata

Related questions

Recent Activity

Donate For Us