I have a PySpark DataFrame, df
, with some columns as shown below. The hour
column is in UTC time and I want to create a new column that has the local time based on the time_zone
column. How can I do that in PySpark?
df
+-------------------------+------------+
| hour | time_zone |
+-------------------------+------------+
|2019-10-16T20:00:00+0000 | US/Eastern |
|2019-10-15T23:00:00+0000 | US/Central |
+-------------------------+------------+
#What I want:
+-------------------------+------------+---------------------+
| hour | time_zone | local_time |
+-------------------------+------------+---------------------+
|2019-10-16T20:00:00+0000 | US/Eastern | 2019-10-16T15:00:00 |
|2019-10-15T23:00:00+0000 | US/Central | 2019-10-15T17:00:00 |
+-------------------------+------------+---------------------+
In Spark SQL, function from_utc_timestamp(timestamp, timezone) converts UTC timestamp to a timestamp in the given time zone; function to_utc_timestamp(timestamp, timezone) converts timestamp in a given time zone to UTC timestamp.
Here first Import dateime,tzutc and tzlocal python module. After that using the datetime. now(tzutc()), I am getting the current UTC date and time. To convert the UTC time to local time I am using the dot operator on UTC time and passing the tzlocal() method inside the astimezone() function.
PySpark to_timestamp() – Convert String to Timestamp typeUse <em>to_timestamp</em>() function to convert String to Timestamp (TimestampType) in PySpark. The converted time would be in a default format of MM-dd-yyyy HH:mm:ss.
You can use the in-built from_utc_timestamp
function. Note that the hour
column needs to be passed in as a string without timezone to the function.
Code below works for spark versions starting 2.4.
from pyspark.sql.functions import *
df.select(from_utc_timestamp(split(df.hour,'\+')[0],df.time_zone).alias('local_time')).show()
For spark versions before 2.4, you have to pass in a constant string representing the time zone, as the second argument, to the function.
Documentation
pyspark.sql.functions.from_utc_timestamp(timestamp, tz)
This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE. This function takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in UTC, and renders that timestamp as a timestamp in the given time zone.
However, timestamp in Spark represents number of microseconds from the Unix epoch, which is not timezone-agnostic. So in Spark this function just shift the timestamp value from UTC timezone to the given timezone.
This function may return confusing result if the input is a string with timezone, e.g. ‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to timestamp according to the timezone in the string, and finally display the result by converting the timestamp to string according to the session local timezone.
Parameters timestamp – the column that contains timestamps
tz – a string that has the ID of timezone, e.g. “GMT”, “America/Los_Angeles”, etc
Changed in version 2.4: tz can take a Column containing timezone ID strings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With