Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark: Subtract Two Timestamp Columns and Give Back Difference in Minutes (Using F.datediff gives back only whole days)

I have the following sample dataframe. The date_1 and date_2 columns have datatype of timestamp.

ID  date_1                      date_2                      date_diff
A   2019-01-09T01:25:00.000Z    2019-01-10T14:00:00.000Z    -1
B   2019-01-12T02:18:00.000Z    2019-01-12T17:00:00.000Z    0

I want to find the different between date_1 and date_2 in minutes.

When I use the code below, it gives me the date_diff column in whole integer values (days):

df = df.withColumn("date_diff", F.datediff(F.col('date_1'), F.col('date_2')))  

But what I want is for date_diff to take into consideration the timestamp and give me minutes back.

How do I do this?

like image 278
PineNuts0 Avatar asked Jan 28 '19 22:01

PineNuts0


People also ask

How do you calculate the difference between two timestamps in PySpark?

Timestamp difference in PySpark can be calculated by using 1) unix_timestamp() to get the Time in seconds and subtract with other time to get the seconds 2) Cast TimestampType column to LongType and subtract two long values to get the difference in seconds, divide it by 60 to get the minute difference and finally ...

How do you find the difference between two timestamps?

Discussion: If you'd like to calculate the difference between the timestamps in seconds, multiply the decimal difference in days by the number of seconds in a day, which equals 24 * 60 * 60 = 86400 , or the product of the number of hours in a day, the number of minutes in an hour, and the number of seconds in a minute.

How do you subtract two date columns in PySpark?

In order to get difference between two dates in days, years, months and quarters in pyspark can be accomplished by using datediff() and months_between() function. datediff() Function calculates the difference between two dates in days in pyspark.

How do you subtract days in PySpark?

In order to subtract or add days , months and years to timestamp in pyspark we will be using date_add() function and add_months() function. add_months() Function with number of months as argument to add months to timestamp in pyspark. date_add() Function number of days as argument to add months to timestamp.


1 Answers

Just convert the timestamps to unix timestamps (seconds since epoch), compute the difference, and divide by 60.

For example:

import pyspark.sql.functions as F
df.withColumn(
    "date_diff_min", 
    (F.col("date_1").cast("long") - F.col("date_2").cast("long"))/60.
).show(truncate=False)
like image 63
pault Avatar answered Nov 11 '22 15:11

pault