I have two timestamp columns in a dataframe that I'd like to get the minute difference of, or alternatively, the hour difference of. Currently I'm able to get the day difference, with rounding, by doing
val df2 = df1.withColumn("time", datediff(df1("ts1"), df1("ts2")))
However, when i looked at the doc page https://issues.apache.org/jira/browse/SPARK-8185 I didn't see any extra parameters to change the unit. Is their a different function I should be using for this?
The answer given by Daniel de Paula works, but that solution does not work in the case where the difference is needed for every row in your table. Here is a solution that will do that for each row:
import org.apache.spark.sql.functions
val df2 = df1.selectExpr("(unix_timestamp(ts1) - unix_timestamp(ts2))/3600")
This first converts the data in the columns to a unix timestamp in seconds, subtracts them and then converts the difference to hours.
A useful list of functions can be found at: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.functions$
You can get the difference in seconds by
import org.apache.spark.sql.functions._
val diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")
Then you can do some math to get the unit you want. For example:
val df2 = df1
.withColumn( "diff_secs", diff_secs_col )
.withColumn( "diff_mins", diff_secs_col / 60D )
.withColumn( "diff_hrs", diff_secs_col / 3600D )
.withColumn( "diff_days", diff_secs_col / (24D * 3600D) )
Or, in pyspark:
from pyspark.sql.functions import *
diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")
df2 = df1 \
.withColumn( "diff_secs", diff_secs_col ) \
.withColumn( "diff_mins", diff_secs_col / 60D ) \
.withColumn( "diff_hrs", diff_secs_col / 3600D ) \
.withColumn( "diff_days", diff_secs_col / (24D * 3600D) )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With