I have the following issue I cannot fully understand in Pyspark. I have the following datetime objects
utc_now = datetime.now().replace(tzinfo=tz.tzutc())
utc_now # datetime.datetime(2018, 2, 12, 13, 9, 52, 785007, tzinfo=tzutc())
and I create a spark DataFrame
data_df = spark.createDataFrame([Row(date=utc_now)])
when I try to show the dataframe
data_df.show(10, False)
the column containing the data is in local time that is 2 hours front
>>> data_df.show(10, False)
+--------------------------+
|date |
+--------------------------+
|2018-02-12 15:09:52.785007|
+--------------------------+
and collecting the data has shifted time in the datetime object two hours front
>>> data_df.collect()
[Row(date=datetime.datetime(2018, 2, 12, 15, 9, 52, 785007))]
Zone info is also removed. Can this behavior be altered when casting to TimestampType
?
TimestampType
in pyspark is not tz aware like in Pandas rather it passes long int
s and displays them according to your machine's local time zone (by default).
That being said, you can change your spark session time zone, using 'spark.sql.session.timeZone'
from datetime import datetime
from dateutil import tz
from pyspark.sql import Row
utc_now = datetime.now().replace(tzinfo=tz.tzutc())
print(utc_now)
spark.conf.set('spark.sql.session.timeZone', 'Europe/Paris')
data_df = spark.createDataFrame([Row(date=utc_now)])
data_df.show(10, False)
print(data_df.collect())
2018-02-12 20:41:16.270386+00:00
+--------------------------+
|date |
+--------------------------+
|2018-02-12 21:41:16.270386|
+--------------------------+
[Row(date=datetime.datetime(2018, 2, 12, 21, 41, 16, 270386))]
spark.conf.set('spark.sql.session.timeZone', 'UTC')
data_df2 = spark.createDataFrame([Row(date=utc_now)])
data_df2.show(10, False)
print(data_df2.collect())
+--------------------------+
|date |
+--------------------------+
|2018-02-12 20:41:16.270386|
+--------------------------+
[Row(date=datetime.datetime(2018, 2, 12, 21, 41, 16, 270386))]
As you can see Spark considers it as UTC but serves it back in the local timezone since Python still has time zone 'Europe/Paris'
import os, time
os.environ['TZ'] = 'UTC'
time.tzset()
utc_now = datetime.now()
spark.conf.set('spark.sql.session.timeZone', 'UTC')
data_df2 = spark.createDataFrame([Row(date=utc_now)])
data_df2.show(10, False)
print(data_df2.collect())
+--------------------------+
|date |
+--------------------------+
|2018-02-12 20:41:16.807757|
+--------------------------+
[Row(date=datetime.datetime(2018, 2, 12, 20, 41, 16, 807757))]
Moreover, pyspark.sql.module
provides you with two functions to convert a timestamp object to another one corresponding to the same time of day (from_utc_timesamp
, to_utc_timestamp
). Although I don't think you want to alter your datetimes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With