I have tried the code as in (this_post) and cannot get the date difference in seconds. I just take the datediff() between the columns 'Attributes_Timestamp_fix' and 'lagged_date' below. Any hints? Below my code and output.
eg = eg.withColumn("lagged_date", lag(eg.Attributes_Timestamp_fix, 1)
.over(Window.partitionBy("id")
.orderBy("Attributes_Timestamp_fix")))
eg = eg.withColumn("time_diff",
datediff(eg.Attributes_Timestamp_fix, eg.lagged_date))
id Attributes_Timestamp_fix time_diff
0 3.531611e+14 2018-04-01 00:01:02 NaN
1 3.531611e+14 2018-04-01 00:01:02 0.0
2 3.531611e+14 2018-04-01 00:03:13 0.0
3 3.531611e+14 2018-04-01 00:03:13 0.0
4 3.531611e+14 2018-04-01 00:03:13 0.0
5 3.531611e+14 2018-04-01 00:03:13 0.0
In order to get difference between two dates in days, years, months and quarters in pyspark can be accomplished by using datediff() and months_between() function. datediff() Function calculates the difference between two dates in days in pyspark.
Discussion: If you'd like to calculate the difference between the timestamps in seconds, multiply the decimal difference in days by the number of seconds in a day, which equals 24 * 60 * 60 = 86400 , or the product of the number of hours in a day, the number of minutes in an hour, and the number of seconds in a minute.
Use the strptime(date_str, format) function to convert a date string into a datetime object as per the corresponding format . To get the difference between two dates, subtract date2 from date1. A result is a timedelta object.
In pyspark.sql.functions
, there is a function datediff
that unfortunately only computes differences in days. To overcome this, you can convert both dates in unix timestamps (in seconds) and compute the difference.
Let's create some sample data, compute the lag and then the difference in seconds.
from pyspark.sql.functions import col, lag, unix_timestamp
from pyspark.sql.window import Window
import datetime
d = [{'id' : 1, 't' : datetime.datetime(2018,01,01)},\
{'id' : 1, 't' : datetime.datetime(2018,01,02)},\
{'id' : 1, 't' : datetime.datetime(2018,01,04)},\
{'id' : 1, 't' : datetime.datetime(2018,01,07)}]
df = spark.createDataFrame(d)
df.show()
+---+-------------------+
| id| t|
+---+-------------------+
| 1|2018-01-01 00:00:00|
| 1|2018-01-02 00:00:00|
| 1|2018-01-04 00:00:00|
| 1|2018-01-07 00:00:00|
+---+-------------------+
w = Window.partitionBy('id').orderBy('t')
df.withColumn("previous_t", lag(df.t, 1).over(w))\
.select(df.t, (unix_timestamp(df.t) - unix_timestamp(col('previous_t'))).alias('diff'))\
.show()
+-------------------+------+
| t| diff|
+-------------------+------+
|2018-01-01 00:00:00| null|
|2018-01-02 00:00:00| 86400|
|2018-01-04 00:00:00|172800|
|2018-01-07 00:00:00|259200|
+-------------------+------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With