Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Change the timestamp to UTC format in Pyspark

I have an input dataframe(ip_df), data in this dataframe looks like as below:

id            timestamp_value
1       2017-08-01T14:30:00+05:30
2       2017-08-01T14:30:00+06:30
3       2017-08-01T14:30:00+07:30

I need to create a new dataframe(op_df), wherein i need to convert timestamp value to UTC format. So final output dataframe will look like as below:

id            timestamp_value
1       2017-08-01T09:00:00+00:00
2       2017-08-01T08:00:00+00:00
3       2017-08-01T07:00:00+00:00

I want to achieve it using PySpark. Can someone please help me with it? Any help will be appericiated.

like image 596
Arunanshu P Avatar asked Aug 01 '17 09:08

Arunanshu P


2 Answers

If you absolutely need the timestamp to be formatted exactly as indicated, namely, with the timezone represented as "+00:00", I think using a UDF as already suggested is your best option.

However, if you can tolerate a slightly different representation of the timezone, e.g. either "+0000" (no colon separator) or "Z", it's possible to do this without a UDF, which may perform significantly better for you depending on the size of your dataset.

Given the following representation of data

+---+-------------------------+
|id |timestamp_value          |
+---+-------------------------+
|1  |2017-08-01T14:30:00+05:30|
|2  |2017-08-01T14:30:00+06:30|
|3  |2017-08-01T14:30:00+07:30|
+---+-------------------------+

as given by:

l = [(1, '2017-08-01T14:30:00+05:30'), (2, '2017-08-01T14:30:00+06:30'), (3, '2017-08-01T14:30:00+07:30')]
ip_df = spark.createDataFrame(l, ['id', 'timestamp_value'])

where timestamp_value is a String, you could do the following (this uses to_timestamp and session local timezone support which were introduced in Spark 2.2):

from pyspark.sql.functions import to_timestamp, date_format
spark.conf.set('spark.sql.session.timeZone', 'UTC')
op_df = ip_df.select(
    date_format(
        to_timestamp(ip_df.timestamp_value, "yyyy-MM-dd'T'HH:mm:ssXXX"), 
        "yyyy-MM-dd'T'HH:mm:ssZ"
    ).alias('timestamp_value'))

which yields:

+------------------------+
|timestamp_value         |
+------------------------+
|2017-08-01T09:00:00+0000|
|2017-08-01T08:00:00+0000|
|2017-08-01T07:00:00+0000|
+------------------------+

or, slightly differently:

op_df = ip_df.select(
    date_format(
        to_timestamp(ip_df.timestamp_value, "yyyy-MM-dd'T'HH:mm:ssXXX"), 
        "yyyy-MM-dd'T'HH:mm:ssXXX"
    ).alias('timestamp_value'))

which yields:

+--------------------+
|timestamp_value     |
+--------------------+
|2017-08-01T09:00:00Z|
|2017-08-01T08:00:00Z|
|2017-08-01T07:00:00Z|
+--------------------+
like image 84
eddies Avatar answered Oct 21 '22 10:10

eddies


You can use parser and tz in dateutil library.
I assume you have Strings and you want a String Column :

from dateutil import parser, tz
from pyspark.sql.types import StringType
from pyspark.sql.functions import col, udf

# Create UTC timezone
utc_zone =  tz.gettz('UTC')

# Create UDF function that apply on the column
# It takes the String, parse it to a timestamp, convert to UTC, then convert to String again
func = udf(lambda x: parser.parse(x).astimezone(utc_zone).isoformat(),  StringType())

# Create new column in your dataset
df = df.withColumn("new_timestamp",func(col("timestamp_value")))

It gives this result :

<pre>
+---+-------------------------+-------------------------+
|id |timestamp_value          |new_timestamp            |
+---+-------------------------+-------------------------+
|1  |2017-08-01T14:30:00+05:30|2017-08-01T09:00:00+00:00|
|2  |2017-08-01T14:30:00+06:30|2017-08-01T08:00:00+00:00|
|3  |2017-08-01T14:30:00+07:30|2017-08-01T07:00:00+00:00|
+---+-------------------------+-------------------------+
</pre>

Finally you can drop and rename :

df = df.drop("timestamp_value").withColumnRenamed("new_timestamp","timestamp_value")
like image 6
Fabich Avatar answered Oct 21 '22 08:10

Fabich