Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get datediff() in seconds in pyspark?

I have tried the code as in (this_post) and cannot get the date difference in seconds. I just take the datediff() between the columns 'Attributes_Timestamp_fix' and 'lagged_date' below. Any hints? Below my code and output.

eg = eg.withColumn("lagged_date", lag(eg.Attributes_Timestamp_fix, 1)
.over(Window.partitionBy("id")
.orderBy("Attributes_Timestamp_fix")))

eg = eg.withColumn("time_diff", 
datediff(eg.Attributes_Timestamp_fix, eg.lagged_date))

        id      Attributes_Timestamp_fix time_diff
0   3.531611e+14    2018-04-01 00:01:02 NaN
1   3.531611e+14    2018-04-01 00:01:02 0.0
2   3.531611e+14    2018-04-01 00:03:13 0.0
3   3.531611e+14    2018-04-01 00:03:13 0.0
4   3.531611e+14    2018-04-01 00:03:13 0.0
5   3.531611e+14    2018-04-01 00:03:13 0.0
like image 281
a_geo Avatar asked Mar 08 '19 23:03

a_geo


People also ask

How do you do datediff in Pyspark?

In order to get difference between two dates in days, years, months and quarters in pyspark can be accomplished by using datediff() and months_between() function. datediff() Function calculates the difference between two dates in days in pyspark.

How do you find the difference between two timestamps?

Discussion: If you'd like to calculate the difference between the timestamps in seconds, multiply the decimal difference in days by the number of seconds in a day, which equals 24 * 60 * 60 = 86400 , or the product of the number of hours in a day, the number of minutes in an hour, and the number of seconds in a minute.

How do you use the datediff function in Python?

Use the strptime(date_str, format) function to convert a date string into a datetime object as per the corresponding format . To get the difference between two dates, subtract date2 from date1. A result is a timedelta object.


1 Answers

In pyspark.sql.functions, there is a function datediff that unfortunately only computes differences in days. To overcome this, you can convert both dates in unix timestamps (in seconds) and compute the difference.

Let's create some sample data, compute the lag and then the difference in seconds.

from pyspark.sql.functions import col, lag, unix_timestamp
from pyspark.sql.window import Window
import datetime

d = [{'id' : 1, 't' : datetime.datetime(2018,01,01)},\
 {'id' : 1, 't' : datetime.datetime(2018,01,02)},\
 {'id' : 1, 't' : datetime.datetime(2018,01,04)},\
 {'id' : 1, 't' : datetime.datetime(2018,01,07)}]

df = spark.createDataFrame(d)
df.show()
+---+-------------------+
| id|                  t|
+---+-------------------+
|  1|2018-01-01 00:00:00|
|  1|2018-01-02 00:00:00|
|  1|2018-01-04 00:00:00|
|  1|2018-01-07 00:00:00|
+---+-------------------+

w = Window.partitionBy('id').orderBy('t')
df.withColumn("previous_t", lag(df.t, 1).over(w))\
  .select(df.t, (unix_timestamp(df.t) - unix_timestamp(col('previous_t'))).alias('diff'))\
  .show()

+-------------------+------+
|                  t|  diff|
+-------------------+------+
|2018-01-01 00:00:00|  null|
|2018-01-02 00:00:00| 86400|
|2018-01-04 00:00:00|172800|
|2018-01-07 00:00:00|259200|
+-------------------+------+
like image 174
Oli Avatar answered Sep 21 '22 06:09

Oli