pyspark: rolling average using timeseries data

Tags:

I have a dataset consisting of a timestamp column and a dollars column. I would like to find the average number of dollars per week ending at the timestamp of each row. I was initially looking at the pyspark.sql.functions.window function, but that bins the data by week.

Here's an example:

%pyspark import datetime from pyspark.sql import functions as F  df1 = sc.parallelize([(17,"2017-03-11T15:27:18+00:00"), (13,"2017-03-11T12:27:18+00:00"), (21,"2017-03-17T11:27:18+00:00")]).toDF(["dollars", "datestring"]) df2 = df1.withColumn('timestampGMT', df1.datestring.cast('timestamp'))  w = df2.groupBy(F.window("timestampGMT", "7 days")).agg(F.avg("dollars").alias('avg')) w.select(w.window.start.cast("string").alias("start"), w.window.end.cast("string").alias("end"), "avg").collect()

This results in two records:

|        start        |          end         | avg | |---------------------|----------------------|-----| |'2017-03-16 00:00:00'| '2017-03-23 00:00:00'| 21.0| |---------------------|----------------------|-----| |'2017-03-09 00:00:00'| '2017-03-16 00:00:00'| 15.0| |---------------------|----------------------|-----|

The window function binned the time series data rather than performing a rolling average.

Is there a way to perform a rolling average where I'll get back a weekly average for each row with a time period ending at the timestampGMT of the row?

EDIT:

Zhang's answer below is close to what I want, but not exactly what I'd like to see.

Here's a better example to show what I'm trying to get at:

%pyspark from pyspark.sql import functions as F df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00"),                         (13, "2017-03-15T12:27:18+00:00"),                         (25, "2017-03-18T11:27:18+00:00")],                         ["dollars", "timestampGMT"]) df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp')) df = df.withColumn('rolling_average', F.avg("dollars").over(Window.partitionBy(F.window("timestampGMT", "7 days"))))

This results in the following dataframe:

dollars timestampGMT            rolling_average 25      2017-03-18 11:27:18.0   25 17      2017-03-10 15:27:18.0   15 13      2017-03-15 12:27:18.0   15

I'd like the average to be over the week proceeding the date in the timestampGMT column, which would result in this:

dollars timestampGMT            rolling_average 17      2017-03-10 15:27:18.0   17 13      2017-03-15 12:27:18.0   15 25      2017-03-18 11:27:18.0   19

In the above results, the rolling_average for 2017-03-10 is 17, since there are no preceding records. The rolling_average for 2017-03-15 is 15 because it is averaging the 13 from 2017-03-15 and the 17 from 2017-03-10 which falls withing the preceding 7 day window. The rolling average for 2017-03-18 is 19 because it is averaging the 25 from 2017-03-18 and the 13 from 2017-03-10 which falls withing the preceding 7 day window, and it is not including the 17 from 2017-03-10 because that does not fall withing the preceding 7 day window.

Is there a way to do this rather than the binning window where the weekly windows don't overlap?

727

asked Aug 21 '17 22:08

Bob Swain

1 Answers

I figured out the correct way to calculate a moving/rolling average using this stackoverflow:

Spark Window Functions - rangeBetween dates

The basic idea is to convert your timestamp column to seconds, and then you can use the rangeBetween function in the pyspark.sql.Window class to include the correct rows in your window.

Here's the solved example:

%pyspark from pyspark.sql import functions as F from pyspark.sql.window import Window   #function to calculate number of seconds from number of days days = lambda i: i * 86400  df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00"),                         (13, "2017-03-15T12:27:18+00:00"),                         (25, "2017-03-18T11:27:18+00:00")],                         ["dollars", "timestampGMT"]) df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))  #create window by casting timestamp to long (number of seconds) w = (Window.orderBy(F.col("timestampGMT").cast('long')).rangeBetween(-days(7), 0))  df = df.withColumn('rolling_average', F.avg("dollars").over(w))

This results in the exact column of rolling averages that I was looking for:

dollars   timestampGMT            rolling_average 17        2017-03-10 15:27:18.0   17.0 13        2017-03-15 12:27:18.0   15.0 25        2017-03-18 11:27:18.0   19.0

118

answered Oct 01 '22 12:10

Bob Swain

Related questions
                            
                                Un-persisting all dataframes in (py)spark
                            
                                Spark SQL replacement for MySQL's GROUP_CONCAT aggregate function
                            
                                Column alias after groupBy in pyspark
                            
                                How to sum the values of one column of a dataframe in spark/scala
                            
                                Split 1 column into 3 columns in spark scala
                            
                                How to serve a Spark MLlib model?
                            
                                Read files sent with spark-submit by the driver
                            
                                How to run Spark code in Airflow?
                            
                                Apache Spark Moving Average
                            
                                What are the Spark transformations that causes a Shuffle?
                            
                                How to set hadoop configuration values from pyspark
                            
                                Add column sum as new column in PySpark dataframe
                            
                                Count number of non-NaN entries in each column of Spark dataframe with Pyspark
                            
                                Spark union of multiple RDDs
                            
                                How to set amount of Spark executors?
                            
                                How to build a sparkSession in Spark 2.0 using pyspark?
                            
                                Aggregating multiple columns with custom function in Spark
                            
                                Specifying the filename when saving a DataFrame as a CSV [duplicate]
                            
                                Calling Java/Scala function from a task
                            
                                Getting the count of records in a data frame quickly

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pyspark: rolling average using timeseries data

Tags:

window-functions

apache-spark

pyspark

moving-average

Bob Swain

People also ask

1 Answers

Bob Swain

Recent Activity

Donate For Us