Applying a Window function to calculate differences in pySpark

Tags:

I am using pySpark, and have set up my dataframe with two columns representing a daily asset price as follows:

ind = sc.parallelize(range(1,5)) prices = sc.parallelize([33.3,31.1,51.2,21.3]) data = ind.zip(prices) df = sqlCtx.createDataFrame(data,["day","price"])

I get upon applying df.show():

+---+-----+ |day|price| +---+-----+ |  1| 33.3| |  2| 31.1| |  3| 51.2| |  4| 21.3| +---+-----+

Which is fine and all. I would like to have another column that contains the day-to-day returns of the price column, i.e., something like

(price(day2)-price(day1))/(price(day1))

After much research, I am told that this is most efficiently accomplished through applying the pyspark.sql.window functions, but I am unable to see how.

937

asked Apr 19 '16 17:04

Thomas Moore

2 Answers

You can bring the previous day column by using lag function, and add additional column that does actual day-to-day return from the two columns, but you may have to tell spark how to partition your data and/or order it to do lag, something like this:

from pyspark.sql.window import Window import pyspark.sql.functions as func from pyspark.sql.functions import lit  dfu = df.withColumn('user', lit('tmoore'))  df_lag = dfu.withColumn('prev_day_price',                         func.lag(dfu['price'])                                  .over(Window.partitionBy("user")))  result = df_lag.withColumn('daily_return',            (df_lag['price'] - df_lag['prev_day_price']) / df_lag['price'] )  >>> result.show() +---+-----+-------+--------------+--------------------+ |day|price|   user|prev_day_price|        daily_return| +---+-----+-------+--------------+--------------------+ |  1| 33.3| tmoore|          null|                null| |  2| 31.1| tmoore|          33.3|-0.07073954983922816| |  3| 51.2| tmoore|          31.1|         0.392578125| |  4| 21.3| tmoore|          51.2|  -1.403755868544601| +---+-----+-------+--------------+--------------------+

Here is longer introduction into Window functions in Spark.

169

answered Oct 23 '22 21:10

Oleksiy

Lag function can help you resolve your use case.

from pyspark.sql.window import Window import pyspark.sql.functions as func  ### Defining the window  Windowspec=Window.orderBy("day")  ### Calculating lag of price at each day level prev_day_price= df.withColumn('prev_day_price',                         func.lag(dfu['price'])                                 .over(Windowspec))  ### Calculating the average                                   result = prev_day_price.withColumn('daily_return',            (prev_day_price['price'] - prev_day_price['prev_day_price']) /  prev_day_price['price'] )

answered Oct 23 '22 21:10

Sushmita Konar

Related questions
                            
                                Can PySpark work without Spark?
                            
                                Does spark predicate pushdown work with JDBC?
                            
                                How to get the lists' length in one column in dataframe spark?
                            
                                AssertionError: col should be Column
                            
                                How to create a udf in PySpark which returns an array of strings?
                            
                                PySpark and broadcast join example
                            
                                Spark union column order
                            
                                Multiple condition filter on dataframe
                            
                                PySpark: modify column values when another column value satisfies a condition
                            
                                environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON
                            
                                How to read csv without header and name them with names while reading in pyspark?
                            
                                How to write the resulting RDD to a csv file in Spark python
                            
                                How does Spark running on YARN account for Python memory usage?
                            
                                How to pivot on multiple columns in Spark SQL?
                            
                                AWS Glue to Redshift: Is it possible to replace, update or delete data?
                            
                                Save content of Spark DataFrame as a single CSV file [duplicate]
                            
                                Passing Array to Spark Lit function
                            
                                Why is Apache-Spark - Python so slow locally as compared to pandas?
                            
                                PySpark Drop Rows
                            
                                Pyspark: filter dataframe by regex with string formatting?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Applying a Window function to calculate differences in pySpark

Tags:

window-functions

pyspark

pyspark-sql

spark-dataframe

Thomas Moore

People also ask

2 Answers

Oleksiy

Sushmita Konar

Recent Activity

Donate For Us