How to get datediff() in seconds in pyspark?

Tags:

I have tried the code as in (this_post) and cannot get the date difference in seconds. I just take the datediff() between the columns 'Attributes_Timestamp_fix' and 'lagged_date' below. Any hints? Below my code and output.

eg = eg.withColumn("lagged_date", lag(eg.Attributes_Timestamp_fix, 1)
.over(Window.partitionBy("id")
.orderBy("Attributes_Timestamp_fix")))

eg = eg.withColumn("time_diff", 
datediff(eg.Attributes_Timestamp_fix, eg.lagged_date))

        id      Attributes_Timestamp_fix time_diff
0   3.531611e+14    2018-04-01 00:01:02 NaN
1   3.531611e+14    2018-04-01 00:01:02 0.0
2   3.531611e+14    2018-04-01 00:03:13 0.0
3   3.531611e+14    2018-04-01 00:03:13 0.0
4   3.531611e+14    2018-04-01 00:03:13 0.0
5   3.531611e+14    2018-04-01 00:03:13 0.0

281

asked Mar 08 '19 23:03

a_geo

1 Answers

In pyspark.sql.functions, there is a function datediff that unfortunately only computes differences in days. To overcome this, you can convert both dates in unix timestamps (in seconds) and compute the difference.

Let's create some sample data, compute the lag and then the difference in seconds.

from pyspark.sql.functions import col, lag, unix_timestamp
from pyspark.sql.window import Window
import datetime

d = [{'id' : 1, 't' : datetime.datetime(2018,01,01)},\
 {'id' : 1, 't' : datetime.datetime(2018,01,02)},\
 {'id' : 1, 't' : datetime.datetime(2018,01,04)},\
 {'id' : 1, 't' : datetime.datetime(2018,01,07)}]

df = spark.createDataFrame(d)
df.show()
+---+-------------------+
| id|                  t|
+---+-------------------+
|  1|2018-01-01 00:00:00|
|  1|2018-01-02 00:00:00|
|  1|2018-01-04 00:00:00|
|  1|2018-01-07 00:00:00|
+---+-------------------+

w = Window.partitionBy('id').orderBy('t')
df.withColumn("previous_t", lag(df.t, 1).over(w))\
  .select(df.t, (unix_timestamp(df.t) - unix_timestamp(col('previous_t'))).alias('diff'))\
  .show()

+-------------------+------+
|                  t|  diff|
+-------------------+------+
|2018-01-01 00:00:00|  null|
|2018-01-02 00:00:00| 86400|
|2018-01-04 00:00:00|172800|
|2018-01-07 00:00:00|259200|
+-------------------+------+

174

answered Sep 21 '22 06:09

Oli

Related questions
                            
                                Mask 2D array preserving shape
                            
                                How to install tensorflow on a offline computer
                            
                                How to read multiple tables from .xls file in python?
                            
                                Split pandas Dataframe into n equal parts + 1
                            
                                Is it ok to call `tape.watch(x)` when `x` is already a `tf.Variable` in TensorFlow?
                            
                                Multiprocessing AsyncResult.get() hangs in Python 3.7.2 but not in 3.6
                            
                                What is the fastest way to select rows that contain a value in a Pandas dataframe?
                            
                                How to redirect url from middleware in Django?
                            
                                Create equal aspect (square) plot with multiple axes when data limits are different?
                            
                                Python Pandas rolling aggregate a column of lists
                            
                                Tone mapping a HDR image using OpenCV 4.0
                            
                                Trying to solve Sudoku with cvxpy
                            
                                Remove Twitter mentions from Pandas column
                            
                                Unable to use proxies one by one until there is a valid response
                            
                                sklearn min_impurity_decrease explanation
                            
                                Why does os.symlink uses path relative to destination?
                            
                                Insert cells in empty Pandas DataFrame
                            
                                Finding n lowest values for each row in a dataframe
                            
                                With AWS SageMaker, is it possible to deploy a pre-trained model using the sagemaker SDK?
                            
                                How to plot the slope (tangent line) of parabola at any point?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get datediff() in seconds in pyspark?

Tags:

python

datediff

apache-spark

pyspark

a_geo

People also ask

1 Answers

Oli

Recent Activity

Donate For Us