Cumulative sum in Spark

Tags:

I want to do cumulative sum in Spark. Here is the register table (input):

+---------------+-------------------+----+----+----+
|     product_id|          date_time| ack|val1|val2|
+---------------+-------------------+----+----+----+
|4008607333T.upf|2017-12-13:02:27:01|3-46|  53|  52|
|4008607333T.upf|2017-12-13:02:27:03|3-47|  53|  52|
|4008607333T.upf|2017-12-13:02:27:08|3-46|  53|  52|
|4008607333T.upf|2017-12-13:02:28:01|3-47|  53|  52|
|4008607333T.upf|2017-12-13:02:28:07|3-46|  15|   1|
+---------------+-------------------+----+----+----+

Hive query:

select *, SUM(val1) over ( Partition by product_id, ack order by date_time rows between unbounded preceding and current row ) val1_sum, SUM(val2) over ( Partition by product_id, ack order by date_time rows between unbounded preceding and current row ) val2_sum from test

Output:

+---------------+-------------------+----+----+----+-------+--------+
|     product_id|          date_time| ack|val1|val2|val_sum|val2_sum|
+---------------+-------------------+----+----+----+-------+--------+
|4008607333T.upf|2017-12-13:02:27:01|3-46|  53|  52|     53|      52|
|4008607333T.upf|2017-12-13:02:27:08|3-46|  53|  52|    106|     104|
|4008607333T.upf|2017-12-13:02:28:07|3-46|  15|   1|    121|     105|
|4008607333T.upf|2017-12-13:02:27:03|3-47|  53|  52|     53|      52|
|4008607333T.upf|2017-12-13:02:28:01|3-47|  53|  52|    106|     104|
+---------------+-------------------+----+----+----+-------+--------+

Using Spark logic, I am getting same above output:

import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('product_id, 'ack).orderBy('date_time)
import org.apache.spark.sql.functions._

val newDf = inputDF.withColumn("val_sum", sum('val1) over w).withColumn("val2_sum", sum('val2) over w)
newDf.show

However, when I try this logic on spark cluster val_sum value will be half of the cumulative sum and something time it is different. I don't know why it is happening on spark cluster. Is it due to partitions?

How I can do cumulative sum of a column on a spark cluster?

966

asked Dec 18 '17 18:12

lucy

1 Answers

To get the cumulative sum using the DataFrame API you should use the rowsBetween window method. In Spark 2.1 and newer create the window as follows:

val w = Window.partitionBy($"product_id", $"ack")
  .orderBy($"date_time")
  .rowsBetween(Window.unboundedPreceding, Window.currentRow)

This will tell Spark to use the values from the beginning of the partition until the current row. Using older versions of Spark, use rowsBetween(Long.MinValue, 0) for the same effect.

To use the window, use the same method as before:

val newDf = inputDF.withColumn("val_sum", sum($"val1").over(w))
  .withColumn("val2_sum", sum($"val2").over(w))

answered Sep 21 '22 15:09

Shaido

Related questions
                            
                                Grouping and counting rows by value until it changes
                            
                                Return Anonymous Type using SqlQuery RAW Query in Entity Framework
                            
                                SQLite - How to perform COUNT() with a WHERE condition?
                            
                                PHP script creates an empty SQL file
                            
                                How to show leading/trailing whitespace in a PostgreSQL column?
                            
                                The named 'CommandType' does not exist in the current context
                            
                                Mysql to get the current date with time 23:59:59
                            
                                How to use GROUP BY to count a new category and old categories at once
                            
                                SQL Server Query - Weird Behaviour
                            
                                SQL - Concat full name, and a space only if last name is present
                            
                                What is the expected behaviour for multiple set-returning functions in SELECT clause?
                            
                                Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding [duplicate]
                            
                                OPENJSON cross apply with NULL values (TSQL)
                            
                                How to simply and efficiently query for nested relationships in SQL?
                            
                                Issues with JSON_EXTRACT in Presto for keys containing ' ' character
                            
                                How to convert an Epoch timestamp to a Date in Standard SQL
                            
                                How to create tables with N:M relationship in MySQL?
                            
                                Split comma separated string into rows in mysql
                            
                                Two foreign keys, one of them not NULL: How to solve this in SQL?
                            
                                Does Apache Spark SQL support MERGE clause?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cumulative sum in Spark

Tags:

sql

scala

apache-spark

hive

cumulative-sum

lucy

People also ask

1 Answers

Shaido

Recent Activity

Donate For Us