I have an input dataframe(ip_df), data in this dataframe looks like as below:
id timestamp_value
1 2017-08-01T14:30:00+05:30
2 2017-08-01T14:30:00+06:30
3 2017-08-01T14:30:00+07:30
I need to create a new dataframe(op_df), wherein i need to convert timestamp value to UTC format. So final output dataframe will look like as below:
id timestamp_value
1 2017-08-01T09:00:00+00:00
2 2017-08-01T08:00:00+00:00
3 2017-08-01T07:00:00+00:00
I want to achieve it using PySpark. Can someone please help me with it? Any help will be appericiated.
If you absolutely need the timestamp to be formatted exactly as indicated, namely, with the timezone represented as "+00:00", I think using a UDF as already suggested is your best option.
However, if you can tolerate a slightly different representation of the timezone, e.g. either "+0000" (no colon separator) or "Z", it's possible to do this without a UDF, which may perform significantly better for you depending on the size of your dataset.
Given the following representation of data
+---+-------------------------+
|id |timestamp_value |
+---+-------------------------+
|1 |2017-08-01T14:30:00+05:30|
|2 |2017-08-01T14:30:00+06:30|
|3 |2017-08-01T14:30:00+07:30|
+---+-------------------------+
as given by:
l = [(1, '2017-08-01T14:30:00+05:30'), (2, '2017-08-01T14:30:00+06:30'), (3, '2017-08-01T14:30:00+07:30')]
ip_df = spark.createDataFrame(l, ['id', 'timestamp_value'])
where timestamp_value
is a String
, you could do the following (this uses to_timestamp and session local timezone support which were introduced in Spark 2.2):
from pyspark.sql.functions import to_timestamp, date_format
spark.conf.set('spark.sql.session.timeZone', 'UTC')
op_df = ip_df.select(
date_format(
to_timestamp(ip_df.timestamp_value, "yyyy-MM-dd'T'HH:mm:ssXXX"),
"yyyy-MM-dd'T'HH:mm:ssZ"
).alias('timestamp_value'))
which yields:
+------------------------+
|timestamp_value |
+------------------------+
|2017-08-01T09:00:00+0000|
|2017-08-01T08:00:00+0000|
|2017-08-01T07:00:00+0000|
+------------------------+
or, slightly differently:
op_df = ip_df.select(
date_format(
to_timestamp(ip_df.timestamp_value, "yyyy-MM-dd'T'HH:mm:ssXXX"),
"yyyy-MM-dd'T'HH:mm:ssXXX"
).alias('timestamp_value'))
which yields:
+--------------------+
|timestamp_value |
+--------------------+
|2017-08-01T09:00:00Z|
|2017-08-01T08:00:00Z|
|2017-08-01T07:00:00Z|
+--------------------+
You can use parser and tz in dateutil library.
I assume you have Strings and you want a String Column :
from dateutil import parser, tz
from pyspark.sql.types import StringType
from pyspark.sql.functions import col, udf
# Create UTC timezone
utc_zone = tz.gettz('UTC')
# Create UDF function that apply on the column
# It takes the String, parse it to a timestamp, convert to UTC, then convert to String again
func = udf(lambda x: parser.parse(x).astimezone(utc_zone).isoformat(), StringType())
# Create new column in your dataset
df = df.withColumn("new_timestamp",func(col("timestamp_value")))
It gives this result :
<pre>
+---+-------------------------+-------------------------+
|id |timestamp_value |new_timestamp |
+---+-------------------------+-------------------------+
|1 |2017-08-01T14:30:00+05:30|2017-08-01T09:00:00+00:00|
|2 |2017-08-01T14:30:00+06:30|2017-08-01T08:00:00+00:00|
|3 |2017-08-01T14:30:00+07:30|2017-08-01T07:00:00+00:00|
+---+-------------------------+-------------------------+
</pre>
Finally you can drop and rename :
df = df.drop("timestamp_value").withColumnRenamed("new_timestamp","timestamp_value")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With