Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to set timezone to UTC in Apache Spark?

In Spark's WebUI (port 8080) and on the environment tab there is a setting of the below:

user.timezone   Zulu

Do you know how/where I can override this to UTC?

Env details:

  • Spark 2.1.1
  • jre-1.8.0-openjdk.x86_64
  • no jdk
  • EC2 Amazon Linux
like image 983
tooptoop4 Avatar asked Apr 04 '18 06:04

tooptoop4


People also ask

How do I set UTC time zone?

(GMT-5:00) Eastern Time (US & Canada)Add the local time offset to the UTC time. For example, if your local time offset is -5:00, and if the UTC time is shown as 11:00, add -5 to 11. The time setting when adjusted for offset is 06:00 (6:00 A.M.). Note The date also follows UTC format.

How do I change the time zone on Spark?

The ID of session local timezone in the format of either region-based zone IDs or zone offsets. Region IDs must have the form 'area/city', such as 'America/Los_Angeles'. Zone offsets must be in the format ' (+|-)HH ', ' (+|-)HH:mm ' or ' (+|-)HH:mm:ss ', e.g '-08', '+01:00' or '-13:33:33'.

How do I change timezone in Pyspark?

pyspark.sql.functions.from_utc_timestamp(timestamp, tz) So in Spark this function just shift the timestamp value from UTC timezone to the given timezone. This function may return confusing result if the input is a string with timezone, e.g. '2018-03-13T06:18:23+00:00'.

How do I change time zones in Databricks?

You can set it in the cluster -> configuration -> Advanced Option -> spark, set the spark parameter: spark. sql. session. timeZone Hongkong.


2 Answers

Now you can use:

spark.conf.set("spark.sql.session.timeZone", "UTC")

Since https://issues.apache.org/jira/browse/SPARK-18936 in 2.2.0

Additionally, I set my default TimeZone to UTC to avoid implicit conversions

TimeZone.setDefault(TimeZone.getTimeZone("UTC"))

Otherwise you will get implicit conversions from your default Timezone to UTC when no Timezone information is present in the Timestamp you're converting

Example:

val rawJson = """ {"some_date_field": "2018-09-14 16:05:37"} """

val dsRaw = sparkJob.spark.createDataset(Seq(rawJson))

val output =
  dsRaw
    .select(
      from_json(
        col("value"),
        new StructType(
          Array(
            StructField("some_date_field", DataTypes.TimestampType)
          )
        )
      ).as("parsed")
    ).select("parsed.*")

If my default TimeZone is Europe/Dublin which is GMT+1 and Spark sql session timezone is set to UTC, Spark will assume that "2018-09-14 16:05:37" is in Europe/Dublin TimeZone and do a conversion (result will be "2018-09-14 15:05:37")

like image 111
Daniel Avatar answered Oct 30 '22 20:10

Daniel


In some cases you will also want to set the JVM timezone. For example, when loading data into a TimestampType column, it will interpret the string in the local JVM timezone. To set the JVM timezone you will need to add extra JVM options for the driver and executor:

spark = pyspark.sql.SparkSession \
    .Builder()\
    .appName('test') \
    .master('local') \
    .config('spark.driver.extraJavaOptions', '-Duser.timezone=GMT') \
    .config('spark.executor.extraJavaOptions', '-Duser.timezone=GMT') \
    .config('spark.sql.session.timeZone', 'UTC') \
    .getOrCreate()

We do this in our local unit test environment, since our local time is not GMT.

Useful reference: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

like image 40
Moemars Avatar answered Oct 30 '22 21:10

Moemars