Is it possible to create a Window function that can have multiple conditions in orderBy for rangeBetween or rowsBetween. Assume I have a data frame like below.
user_id timestamp date event
0040b5f0 2018-01-22 13:04:32 2018-01-22 1
0040b5f0 2018-01-22 13:04:35 2018-01-22 0
0040b5f0 2018-01-25 18:55:08 2018-01-25 1
0040b5f0 2018-01-25 18:56:17 2018-01-25 1
0040b5f0 2018-01-25 20:51:43 2018-01-25 1
0040b5f0 2018-01-31 07:48:43 2018-01-31 1
0040b5f0 2018-01-31 07:48:48 2018-01-31 0
0040b5f0 2018-02-02 09:40:58 2018-02-02 1
0040b5f0 2018-02-02 09:41:01 2018-02-02 0
0040b5f0 2018-02-05 14:03:27 2018-02-05 1
Per each row, I need the sum of event column values that have date no longer than 3 days. But I cannot take sum events that happened later on the same date. I can create a window function like:
days = lambda i: i * 86400
my_window = Window\
.partitionBy(["user_id"])\
.orderBy(F.col("date").cast("timestamp").cast("long"))\
.rangeBetween(-days(3), 0)
But this will include events that happened later on the same date. I need to create a window function that will act like (for a row with *):
user_id timestamp date event
0040b5f0 2018-01-22 13:04:32 2018-01-22 1----|==============|
0040b5f0 2018-01-22 13:04:35 2018-01-22 0 sum here all events
0040b5f0 2018-01-25 18:55:08 2018-01-25 1 only within 3 days
* 0040b5f0 2018-01-25 18:56:17 2018-01-25 1----| |
0040b5f0 2018-01-25 20:51:43 2018-01-25 1===================|
0040b5f0 2018-01-31 07:48:43 2018-01-31 1
0040b5f0 2018-01-31 07:48:48 2018-01-31 0
0040b5f0 2018-02-02 09:40:58 2018-02-02 1
0040b5f0 2018-02-02 09:41:01 2018-02-02 0
0040b5f0 2018-02-05 14:03:27 2018-02-05 1
I tried to create something like:
days = lambda i: i * 86400
my_window = Window\
.partitionBy(["user_id"])\
.orderBy(F.col("date").cast("timestamp").cast("long"))\
.rangeBetween(-days(3), Window.currentRow)\
.orderBy(F.col("t_stamp"))\
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
But it only reflects the last orderBy.
The result table should look like:
user_id timestamp date event event_last_3d
0040b5f0 2018-01-22 13:04:32 2018-01-22 1 1
0040b5f0 2018-01-22 13:04:35 2018-01-22 0 1
0040b5f0 2018-01-25 18:55:08 2018-01-25 1 2
0040b5f0 2018-01-25 18:56:17 2018-01-25 1 3
0040b5f0 2018-01-25 20:51:43 2018-01-25 1 4
0040b5f0 2018-01-31 07:48:43 2018-01-31 1 1
0040b5f0 2018-01-31 07:48:48 2018-01-31 0 1
0040b5f0 2018-02-02 09:40:58 2018-02-02 1 2
0040b5f0 2018-02-02 09:41:01 2018-02-02 0 2
0040b5f0 2018-02-05 14:03:27 2018-02-05 1 2
I've been stuck on this one for some time, I would appreciate any advice on how to approach it.
I have written the equivalent in scala that achieves your requirement. I think it shouldn't be difficult to convert to python:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val DAY_SECS = 24*60*60 //Seconds in a day
//Given a timestamp in seconds, returns the seconds equivalent of 00:00:00 of that date
val trimToDateBoundary = (d: Long) => (d / 86400) * 86400
//Using 4 for range here - since your requirement is to cover 3 days prev, which date wise inclusive is 4 days
//So e.g. given any TS of 25 Jan, the range will cover (25 Jan 00:00:00 - 4 times day_secs = 22 Jan 00:00:00) to current TS
val wSpec = Window.partitionBy("user_id").
orderBy(col("timestamp").cast("long")).
rangeBetween(trimToDateBoundary(Window.currentRow)-(4*DAY_SECS), Window.currentRow)
df.withColumn("sum", sum('event) over wSpec).show()
Below is the output of this when applied on your data:
+--------+--------------------+--------------------+-----+---+
| user_id| timestamp| date|event|sum|
+--------+--------------------+--------------------+-----+---+
|0040b5f0|2018-01-22 13:04:...|2018-01-22 00:00:...| 1.0|1.0|
|0040b5f0|2018-01-22 13:04:...|2018-01-22 00:00:...| 0.0|1.0|
|0040b5f0|2018-01-25 18:55:...|2018-01-25 00:00:...| 1.0|2.0|
|0040b5f0|2018-01-25 18:56:...|2018-01-25 00:00:...| 1.0|3.0|
|0040b5f0|2018-01-25 20:51:...|2018-01-25 00:00:...| 1.0|4.0|
|0040b5f0|2018-01-31 07:48:...|2018-01-31 00:00:...| 1.0|1.0|
|0040b5f0|2018-01-31 07:48:...|2018-01-31 00:00:...| 0.0|1.0|
|0040b5f0|2018-02-02 09:40:...|2018-02-02 00:00:...| 1.0|2.0|
|0040b5f0|2018-02-02 09:41:...|2018-02-02 00:00:...| 0.0|2.0|
|0040b5f0|2018-02-05 14:03:...|2018-02-05 00:00:...| 1.0|2.0|
+--------+--------------------+--------------------+-----+---+
I have not used the "date" column. Not sure how we can achieve your requirement with that being considered. So, if there's possibility of TS's date being different than date column, then this solution doesn't cover it.
Note: rangeBetween
that accepts Column
args have been introduced in Spark 2.3.0 that accepts date/timestamp type columns. So, this solution may be more elegant.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With