Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark timestamp difference

I am trying to do a timestamp difference in Spark and it is not working as expected.

Below is how I'm trying to

import org.apache.spark.sql.functions.*
df = df.withColumn("TimeStampDiff", from_unixtime(unix_timestamp(df.col("TimeStampHigh"), "HH:mm:ss").minus(unix_timestamp(df.col("TimeStampLow"), "HH:mm:ss")),"HH:mm:ss"))

Values

TimeStampHigh - 15:57:01
TimeStampLow - 00:11:57

It returns me a result of 10:45:04 Expected output - 15:45:04

My other alternative is to go to an UDF with Java implementation.

Any pointers will help.

like image 236
Vinay Avatar asked May 19 '26 02:05

Vinay


1 Answers

That's because from_unixtime (emphasis mine):

Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.

Clearly your system or JVM is not configured to use UTC time.

You should do one of the following:

  • Configure JVM to use appropriate time zone (-Duser.timezone=UTC for both spark.executor.extraJavaOptions and spark.driver.extraJavaOptions).
  • Set spark.sql.session.timeZone to use appropriate time zone.

Example:

scala> val df = Seq(("15:57:01", "00:11:57")).toDF("TimeStampHigh", "TimeStampLow")
df: org.apache.spark.sql.DataFrame = [TimeStampHigh: string, TimeStampLow: string]

scala> spark.conf.set("spark.sql.session.timeZone", "GMT-5")  // Equivalent to your current settings

scala> df.withColumn("TimeStampDiff", from_unixtime(unix_timestamp(df.col("TimeStampHigh"), "HH:mm:ss").minus(unix_timestamp(df.col("TimeStampLow"), "HH:mm:ss")),"HH:mm:ss")).show
+-------------+------------+-------------+
|TimeStampHigh|TimeStampLow|TimeStampDiff|
+-------------+------------+-------------+
|     15:57:01|    00:11:57|     10:45:04|
+-------------+------------+-------------+


scala> spark.conf.set("spark.sql.session.timeZone", "UTC")  // With UTC

scala> df.withColumn("TimeStampDiff", from_unixtime(unix_timestamp(df.col("TimeStampHigh"), "HH:mm:ss").minus(unix_timestamp(df.col("TimeStampLow"), "HH:mm:ss")),"HH:mm:ss")).show
+-------------+------------+-------------+
|TimeStampHigh|TimeStampLow|TimeStampDiff|
+-------------+------------+-------------+
|     15:57:01|    00:11:57|     15:45:04|
+-------------+------------+-------------+
like image 76
Alper t. Turker Avatar answered May 21 '26 15:05

Alper t. Turker



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!