I have a PySpark DataFrame, <code>df</code>, with some columns as shown below. The <code>hour</code> column is in UTC time and I want to create a new column that has the local time based on the <code>time_zone</code> column. How can I do that in PySpark? <pre class="prettyprint"><code>df +-------------------------+------------+ | hour | time_zone | +-------------------------+------------+ |2019-10-16T20:00:00+0000 | US/Eastern | |2019-10-15T23:00:00+0000 | US/Central | +-------------------------+------------+ #What I want: +-------------------------+------------+---------------------+ | hour | time_zone | local_time | +-------------------------+------------+---------------------+ |2019-10-16T20:00:00+0000 | US/Eastern | 2019-10-16T15:00:00 | |2019-10-15T23:00:00+0000 | US/Central | 2019-10-15T17:00:00 | +-------------------------+------------+---------------------+ </code></pre>

You can use the in-built <code>from_utc_timestamp</code> function. Note that the <code>hour</code> column needs to be passed in as a string without timezone to the function. Code below works for spark versions starting 2.4. <pre class="prettyprint"><code>from pyspark.sql.functions import * df.select(from_utc_timestamp(split(df.hour,'\+')[0],df.time_zone).alias('local_time')).show() </code></pre> For spark versions before 2.4, you have to pass in a constant string representing the time zone, as the second argument, to the function. <code>Documentation</code> <code>pyspark.sql.functions.from_utc_timestamp(timestamp, tz)</code> <blockquote> This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE. This function takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in UTC, and renders that timestamp as a timestamp in the given time zone. </blockquote> <blockquote> However, timestamp in Spark represents number of microseconds from the Unix epoch, which is not timezone-agnostic. So in Spark this function just shift the timestamp value from UTC timezone to the given timezone. </blockquote> <blockquote> This function may return confusing result if the input is a string with timezone, e.g. ‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to timestamp according to the timezone in the string, and finally display the result by converting the timestamp to string according to the session local timezone. </blockquote> <blockquote> Parameters timestamp – the column that contains timestamps </blockquote> <blockquote> tz – a string that has the ID of timezone, e.g. “GMT”, “America/Los_Angeles”, etc </blockquote> <blockquote> Changed in version 2.4: tz can take a Column containing timezone ID strings. </blockquote>

Convert UTC timestamp to local time based on time zone in PySpark

Tags:

apache-spark

apache-spark-sql

pyspark

I have a PySpark DataFrame, df, with some columns as shown below. The hour column is in UTC time and I want to create a new column that has the local time based on the time_zone column. How can I do that in PySpark?

df
    +-------------------------+------------+
    |  hour                   | time_zone  |
    +-------------------------+------------+
    |2019-10-16T20:00:00+0000 | US/Eastern |
    |2019-10-15T23:00:00+0000 | US/Central |
    +-------------------------+------------+

#What I want:
    +-------------------------+------------+---------------------+
    |  hour                   | time_zone  | local_time          |
    +-------------------------+------------+---------------------+
    |2019-10-16T20:00:00+0000 | US/Eastern | 2019-10-16T15:00:00 |
    |2019-10-15T23:00:00+0000 | US/Central | 2019-10-15T17:00:00 |
    +-------------------------+------------+---------------------+

661

asked Dec 02 '19 19:12

Gaurav Bansal

1 Answers

You can use the in-built from_utc_timestamp function. Note that the hour column needs to be passed in as a string without timezone to the function.

Code below works for spark versions starting 2.4.

from pyspark.sql.functions import *
df.select(from_utc_timestamp(split(df.hour,'\+')[0],df.time_zone).alias('local_time')).show()

For spark versions before 2.4, you have to pass in a constant string representing the time zone, as the second argument, to the function.

Documentation

pyspark.sql.functions.from_utc_timestamp(timestamp, tz)

This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE. This function takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in UTC, and renders that timestamp as a timestamp in the given time zone.

However, timestamp in Spark represents number of microseconds from the Unix epoch, which is not timezone-agnostic. So in Spark this function just shift the timestamp value from UTC timezone to the given timezone.

This function may return confusing result if the input is a string with timezone, e.g. ‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to timestamp according to the timezone in the string, and finally display the result by converting the timestamp to string according to the session local timezone.

Parameters timestamp – the column that contains timestamps

tz – a string that has the ID of timezone, e.g. “GMT”, “America/Los_Angeles”, etc

Changed in version 2.4: tz can take a Column containing timezone ID strings.

answered Nov 15 '22 07:11

Vamsi Prabhala

Related questions
                            
                                Converting pattern of date in spark dataframe
                            
                                How to convert RDD[Row] to RDD[String]
                            
                                What is the faster way to count the number of entries in a data frame?
                            
                                apache-spark startup error on alpine linux docker
                            
                                Spark Scala Dataframe convert a column of Array of Struct to a column of Map
                            
                                Dummy Encoding using Pyspark [duplicate]
                            
                                How to create a Dataset of Maps?
                            
                                Spark Structured Streaming with Hbase integration
                            
                                How does Spark 2.0 handle column nullability?
                            
                                Spark: Extracting summary for a ML logistic regression model from a pipeline model
                            
                                Pyspark, Add a character in the middle of a string
                            
                                How to implement Functor[Dataset]
                            
                                Understanding Kryo serialization buffer overflow error
                            
                                Using UDF ignores condition in when
                            
                                Spark: select with key in map
                            
                                How to bucketize a group of columns in pyspark?
                            
                                ERROR : User did not initialize spark context
                            
                                Why does Spark's Word2Vec return a vector?
                            
                                Set spark configuration
                            
                                PySpark explode stringified array of dictionaries into rows

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With