I am new to spark programming. Need help with spark python program, where i have input data like this and want to get cumulative summary for each group. Appreciate if someone guide me on this.
11,1,1,100
11,1,2,150
12,1,1,50
12,2,1,70
12,2,2,20
11,1,1,100
11,1,2,250 //(100+150)
12,1,1,50
12,2,1,70
12,2,2,90 // (70+20)
the code i tried:
def parseline(line):
fields = line.split(",")
f1 = float(fields[0])
f2 = float(fields[1])
f3 = float(fields[2])
f4 = float(fields[3])
return (f1, f2, f3, f4)
input = sc.textFile("FIle:///...../a.dat")
line = input.map(parseline)
linesorted = line.sortBy(lambda x: (x[0], x[1], x[2]))
runningpremium = linesorted.map(lambda y: (((y[0], y[1]), y[3])).reduceByKey(lambda accum, num: accum + num)
for i in runningpremium.collect():
print i
As in the comment, you can use window function to do the cumulative sum here on Spark Dataframe. First, we can create an example dataframe with dummie columns 'a', 'b', 'c', 'd'
ls = [(11,1,1,100), (11,1,2,150), (12,1,1,50), (12,2,1,70), (12,2,2,20)]
ls_rdd = spark.sparkContext.parallelize(ls)
df = spark.createDataFrame(ls_rdd, schema=['a', 'b', 'c', 'd'])
You can partition by column a and b then order by column c. Then, apply the sum function over the column d at the end
from pyspark.sql.window import Window
import pyspark.sql.functions as func
w = Window.partitionBy([df['a'], df['b']]).orderBy(df['c'].asc())
df_cumsum = df.select('a', 'b', 'c', func.sum(df.d).over(w).alias('cum_sum'))
df_cumsum.sort(['a', 'b', 'c']).show() # simple sort column
Output
+---+---+---+-------+
| a| b| c|cum_sum|
+---+---+---+-------+
| 11| 1| 1| 100|
| 11| 1| 2| 250|
| 12| 1| 1| 50|
| 12| 2| 1| 70|
| 12| 2| 2| 90|
+---+---+---+-------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With