pyspark function.lag on condition

Question

I am trying to solve a problem with pyspark,

I have a dataset such as:

Condition | Date
0 | 2019/01/10
1 | 2019/01/11
0 | 2019/01/15
1 | 2019/01/16
1 | 2019/01/19
0 | 2019/01/23
0 | 2019/01/25
1 | 2019/01/29
1 | 2019/01/30

I would like to get the latest lag value of the date column when condition == 1 was met

The desired output would be something like:

Condition | Date | Lag
0 | 2019/01/10 | NaN
1 | 2019/01/11 | NaN
0 | 2019/01/15 | 2019/01/11
1 | 2019/01/16 | 2019/01/11
1 | 2019/01/19 | 2019/01/16
0 | 2019/01/23 | 2019/01/19
0 | 2019/01/25 | 2019/01/19
1 | 2019/01/29 | 2019/01/19
1 | 2019/01/30 | 2019/01/29

How can I perform that?

Please do keep in mind its a very large dataset - which I will have to partition and group by an UUID so the solution has to be somewhat performatic.

Thank you,

blackbishop · Accepted Answer

Here is a solution with Pyspark. The logic remains the same as @GordonLinoff's solution with SQL query.

w = Window.orderBy("Date").rowsBetween(Window.unboundedPreceding, Window.currentRow - 1)
df.withColumn("Lag", max(when(col("Condition") == lit(1), col("Date"))).over(w)).show()

Gives:

+---------+----------+----------+
|Condition|      Date|       Lag|
+---------+----------+----------+
|        0|2019/01/10|      null|
|        1|2019/01/11|      null|
|        0|2019/01/15|2019/01/11|
|        1|2019/01/16|2019/01/11|
|        1|2019/01/19|2019/01/16|
|        0|2019/01/23|2019/01/19|
|        0|2019/01/25|2019/01/19|
|        1|2019/01/29|2019/01/19|
|        1|2019/01/30|2019/01/29|
+---------+----------+----------+

Gordon Linoff · Answer

In SQL, you can use a conditional running max():

select t.*,
       max(case when condition = 1 then date end) over (order by date
                                                        rows between unbounded preceding and 1 preceding
                                                       ) as prev_condition_1_date
from t;

pyspark function.lag on condition

Tags:

apache-spark

apache-spark-sql

pyspark

epattaro

2 Answers

blackbishop

Gordon Linoff

Recent Activity

Donate For Us

pyspark function.lag on condition

Tags:

apache-spark

apache-spark-sql

pyspark

epattaro

2 Answers

blackbishop

Gordon Linoff

Related questions

Recent Activity

Donate For Us