Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert UTC timestamp to local time based on time zone in PySpark

I have a PySpark DataFrame, df, with some columns as shown below. The hour column is in UTC time and I want to create a new column that has the local time based on the time_zone column. How can I do that in PySpark?

df
    +-------------------------+------------+
    |  hour                   | time_zone  |
    +-------------------------+------------+
    |2019-10-16T20:00:00+0000 | US/Eastern |
    |2019-10-15T23:00:00+0000 | US/Central |
    +-------------------------+------------+

#What I want:
    +-------------------------+------------+---------------------+
    |  hour                   | time_zone  | local_time          |
    +-------------------------+------------+---------------------+
    |2019-10-16T20:00:00+0000 | US/Eastern | 2019-10-16T15:00:00 |
    |2019-10-15T23:00:00+0000 | US/Central | 2019-10-15T17:00:00 |
    +-------------------------+------------+---------------------+
like image 661
Gaurav Bansal Avatar asked Dec 02 '19 19:12

Gaurav Bansal


People also ask

How do you convert UTC time to local time in PySpark?

In Spark SQL, function from_utc_timestamp(timestamp, timezone) converts UTC timestamp to a timestamp in the given time zone; function to_utc_timestamp(timestamp, timezone) converts timestamp in a given time zone to UTC timestamp.

How do I convert UTC Timestamp to local time in python?

Here first Import dateime,tzutc and tzlocal python module. After that using the datetime. now(tzutc()), I am getting the current UTC date and time. To convert the UTC time to local time I am using the dot operator on UTC time and passing the tzlocal() method inside the astimezone() function.

How do I convert a column to Timestamp in PySpark?

PySpark to_timestamp() – Convert String to Timestamp typeUse <em>to_timestamp</em>() function to convert String to Timestamp (TimestampType) in PySpark. The converted time would be in a default format of MM-dd-yyyy HH:mm:ss.


1 Answers

You can use the in-built from_utc_timestamp function. Note that the hour column needs to be passed in as a string without timezone to the function.

Code below works for spark versions starting 2.4.

from pyspark.sql.functions import *
df.select(from_utc_timestamp(split(df.hour,'\+')[0],df.time_zone).alias('local_time')).show()

For spark versions before 2.4, you have to pass in a constant string representing the time zone, as the second argument, to the function.

Documentation

pyspark.sql.functions.from_utc_timestamp(timestamp, tz)

This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE. This function takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in UTC, and renders that timestamp as a timestamp in the given time zone.

However, timestamp in Spark represents number of microseconds from the Unix epoch, which is not timezone-agnostic. So in Spark this function just shift the timestamp value from UTC timezone to the given timezone.

This function may return confusing result if the input is a string with timezone, e.g. ‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to timestamp according to the timezone in the string, and finally display the result by converting the timestamp to string according to the session local timezone.

Parameters timestamp – the column that contains timestamps

tz – a string that has the ID of timezone, e.g. “GMT”, “America/Los_Angeles”, etc

Changed in version 2.4: tz can take a Column containing timezone ID strings.

like image 78
Vamsi Prabhala Avatar answered Nov 15 '22 07:11

Vamsi Prabhala