I am trying to solve a problem with pyspark,
I have a dataset such as:
Condition | Date
0 | 2019/01/10
1 | 2019/01/11
0 | 2019/01/15
1 | 2019/01/16
1 | 2019/01/19
0 | 2019/01/23
0 | 2019/01/25
1 | 2019/01/29
1 | 2019/01/30
I would like to get the latest lag value of the date column when condition == 1 was met
The desired output would be something like:
Condition | Date | Lag
0 | 2019/01/10 | NaN
1 | 2019/01/11 | NaN
0 | 2019/01/15 | 2019/01/11
1 | 2019/01/16 | 2019/01/11
1 | 2019/01/19 | 2019/01/16
0 | 2019/01/23 | 2019/01/19
0 | 2019/01/25 | 2019/01/19
1 | 2019/01/29 | 2019/01/19
1 | 2019/01/30 | 2019/01/29
How can I perform that?
Please do keep in mind its a very large dataset - which I will have to partition and group by an UUID so the solution has to be somewhat performatic.
Thank you,
Here is a solution with Pyspark. The logic remains the same as @GordonLinoff's solution with SQL query.
w = Window.orderBy("Date").rowsBetween(Window.unboundedPreceding, Window.currentRow - 1)
df.withColumn("Lag", max(when(col("Condition") == lit(1), col("Date"))).over(w)).show()
Gives:
+---------+----------+----------+
|Condition| Date| Lag|
+---------+----------+----------+
| 0|2019/01/10| null|
| 1|2019/01/11| null|
| 0|2019/01/15|2019/01/11|
| 1|2019/01/16|2019/01/11|
| 1|2019/01/19|2019/01/16|
| 0|2019/01/23|2019/01/19|
| 0|2019/01/25|2019/01/19|
| 1|2019/01/29|2019/01/19|
| 1|2019/01/30|2019/01/29|
+---------+----------+----------+
In SQL, you can use a conditional running max():
select t.*,
max(case when condition = 1 then date end) over (order by date
rows between unbounded preceding and 1 preceding
) as prev_condition_1_date
from t;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With