Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Spark How to find cumulative sum by group using RDD API

I am new to spark programming. Need help with spark python program, where i have input data like this and want to get cumulative summary for each group. Appreciate if someone guide me on this.

Input Data:

11,1,1,100

11,1,2,150

12,1,1,50

12,2,1,70

12,2,2,20

Output Data Needed like this:

11,1,1,100

11,1,2,250 //(100+150)

12,1,1,50

12,2,1,70

12,2,2,90 // (70+20)

the code i tried:

def parseline(line):
    fields = line.split(",")
    f1 = float(fields[0])
    f2 = float(fields[1])
    f3 = float(fields[2])
    f4 = float(fields[3])
    return (f1, f2, f3, f4)

input = sc.textFile("FIle:///...../a.dat")
line = input.map(parseline)
linesorted = line.sortBy(lambda x: (x[0], x[1], x[2]))
runningpremium = linesorted.map(lambda y: (((y[0], y[1]),     y[3])).reduceByKey(lambda accum, num: accum + num)

for i in runningpremium.collect():
      print i
like image 250
RoyR Avatar asked Feb 22 '26 11:02

RoyR


1 Answers

As in the comment, you can use window function to do the cumulative sum here on Spark Dataframe. First, we can create an example dataframe with dummie columns 'a', 'b', 'c', 'd'

ls = [(11,1,1,100), (11,1,2,150), (12,1,1,50), (12,2,1,70), (12,2,2,20)]
ls_rdd = spark.sparkContext.parallelize(ls)
df = spark.createDataFrame(ls_rdd, schema=['a', 'b', 'c', 'd'])

You can partition by column a and b then order by column c. Then, apply the sum function over the column d at the end

from pyspark.sql.window import Window
import pyspark.sql.functions as func

w = Window.partitionBy([df['a'], df['b']]).orderBy(df['c'].asc())
df_cumsum = df.select('a', 'b', 'c', func.sum(df.d).over(w).alias('cum_sum'))
df_cumsum.sort(['a', 'b', 'c']).show() # simple sort column

Output

+---+---+---+-------+
|  a|  b|  c|cum_sum|
+---+---+---+-------+
| 11|  1|  1|    100|
| 11|  1|  2|    250|
| 12|  1|  1|     50|
| 12|  2|  1|     70|
| 12|  2|  2|     90|
+---+---+---+-------+
like image 148
titipata Avatar answered Feb 25 '26 08:02

titipata



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!