Spark add new column to dataframe with value from previous row

Question

I'm wondering how I can achieve the following in Spark (Pyspark)

Initial Dataframe:

+--+---+
|id|num|
+--+---+
|4 |9.0|
+--+---+
|3 |7.0|
+--+---+
|2 |3.0|
+--+---+
|1 |5.0|
+--+---+

Resulting Dataframe:

+--+---+-------+
|id|num|new_Col|
+--+---+-------+
|4 |9.0|  7.0  |
+--+---+-------+
|3 |7.0|  3.0  |
+--+---+-------+
|2 |3.0|  5.0  |
+--+---+-------+

I manage to generally "append" new columns to a dataframe by using something like: df.withColumn("new_Col", df.num * 10)

However I have no idea on how I can achieve this "shift of rows" for the new column, so that the new column has the value of a field from the previous row (as shown in the example). I also couldn't find anything in the API documentation on how to access a certain row in a DF by index.

Any help would be appreciated.

zero323 · Accepted Answer

You can use lag window function as follows

from pyspark.sql.functions import lag, col
from pyspark.sql.window import Window

df = sc.parallelize([(4, 9.0), (3, 7.0), (2, 3.0), (1, 5.0)]).toDF(["id", "num"])
w = Window().partitionBy().orderBy(col("id"))
df.select("*", lag("num").over(w).alias("new_col")).na.drop().show()

## +---+---+-------+
## | id|num|new_col|
## +---+---+-------|
## |  2|3.0|    5.0|
## |  3|7.0|    3.0|
## |  4|9.0|    7.0|
## +---+---+-------+

but there some important issues:

if you need a global operation (not partitioned by some other column / columns) it is extremely inefficient.
you need a natural way to order your data.

While the second issue is almost never a problem the first one can be a deal-breaker. If this is the case you should simply convert your DataFrame to RDD and compute lag manually. See for example:

How to transform data with sliding window over time series data in Pyspark
Apache Spark Moving Average (written in Scala, but can be adjusted for PySpark. Be sure to read the comments first).

Spark add new column to dataframe with value from previous row

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

Kito

1 Answers

zero323

Recent Activity

Donate For Us

Spark add new column to dataframe with value from previous row

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

Kito

1 Answers

zero323

Related questions

Recent Activity

Donate For Us