Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark function.lag on condition

I am trying to solve a problem with pyspark,

I have a dataset such as:

Condition | Date
0 | 2019/01/10
1 | 2019/01/11
0 | 2019/01/15
1 | 2019/01/16
1 | 2019/01/19
0 | 2019/01/23
0 | 2019/01/25
1 | 2019/01/29
1 | 2019/01/30

I would like to get the latest lag value of the date column when condition == 1 was met

The desired output would be something like:

Condition | Date | Lag
0 | 2019/01/10 | NaN
1 | 2019/01/11 | NaN
0 | 2019/01/15 | 2019/01/11
1 | 2019/01/16 | 2019/01/11
1 | 2019/01/19 | 2019/01/16
0 | 2019/01/23 | 2019/01/19
0 | 2019/01/25 | 2019/01/19
1 | 2019/01/29 | 2019/01/19
1 | 2019/01/30 | 2019/01/29

How can I perform that?

Please do keep in mind its a very large dataset - which I will have to partition and group by an UUID so the solution has to be somewhat performatic.

Thank you,

like image 536
epattaro Avatar asked Dec 22 '25 02:12

epattaro


2 Answers

Here is a solution with Pyspark. The logic remains the same as @GordonLinoff's solution with SQL query.

w = Window.orderBy("Date").rowsBetween(Window.unboundedPreceding, Window.currentRow - 1)
df.withColumn("Lag", max(when(col("Condition") == lit(1), col("Date"))).over(w)).show()

Gives:

+---------+----------+----------+
|Condition|      Date|       Lag|
+---------+----------+----------+
|        0|2019/01/10|      null|
|        1|2019/01/11|      null|
|        0|2019/01/15|2019/01/11|
|        1|2019/01/16|2019/01/11|
|        1|2019/01/19|2019/01/16|
|        0|2019/01/23|2019/01/19|
|        0|2019/01/25|2019/01/19|
|        1|2019/01/29|2019/01/19|
|        1|2019/01/30|2019/01/29|
+---------+----------+----------+
like image 111
blackbishop Avatar answered Dec 24 '25 02:12

blackbishop


In SQL, you can use a conditional running max():

select t.*,
       max(case when condition = 1 then date end) over (order by date
                                                        rows between unbounded preceding and 1 preceding
                                                       ) as prev_condition_1_date
from t;
like image 28
Gordon Linoff Avatar answered Dec 24 '25 04:12

Gordon Linoff



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!