Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Scala: DateDiff of two columns by hour or minute

I have two timestamp columns in a dataframe that I'd like to get the minute difference of, or alternatively, the hour difference of. Currently I'm able to get the day difference, with rounding, by doing

val df2 = df1.withColumn("time", datediff(df1("ts1"), df1("ts2")))

However, when i looked at the doc page https://issues.apache.org/jira/browse/SPARK-8185 I didn't see any extra parameters to change the unit. Is their a different function I should be using for this?

like image 368
mt88 Avatar asked May 05 '16 18:05

mt88


2 Answers

The answer given by Daniel de Paula works, but that solution does not work in the case where the difference is needed for every row in your table. Here is a solution that will do that for each row:

import org.apache.spark.sql.functions

val df2 = df1.selectExpr("(unix_timestamp(ts1) - unix_timestamp(ts2))/3600")

This first converts the data in the columns to a unix timestamp in seconds, subtracts them and then converts the difference to hours.

A useful list of functions can be found at: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.functions$

like image 43
Jeremy Avatar answered Sep 23 '22 17:09

Jeremy


You can get the difference in seconds by

import org.apache.spark.sql.functions._
val diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")

Then you can do some math to get the unit you want. For example:

val df2 = df1
  .withColumn( "diff_secs", diff_secs_col )
  .withColumn( "diff_mins", diff_secs_col / 60D )
  .withColumn( "diff_hrs",  diff_secs_col / 3600D )
  .withColumn( "diff_days", diff_secs_col / (24D * 3600D) )

Or, in pyspark:

from pyspark.sql.functions import *
diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")

df2 = df1 \
  .withColumn( "diff_secs", diff_secs_col ) \
  .withColumn( "diff_mins", diff_secs_col / 60D ) \
  .withColumn( "diff_hrs",  diff_secs_col / 3600D ) \
  .withColumn( "diff_days", diff_secs_col / (24D * 3600D) )
like image 95
Daniel de Paula Avatar answered Sep 19 '22 17:09

Daniel de Paula