TimeStampType in Pyspark with datetime tzaware objects

Question

I have the following issue I cannot fully understand in Pyspark. I have the following datetime objects

utc_now = datetime.now().replace(tzinfo=tz.tzutc())
utc_now # datetime.datetime(2018, 2, 12, 13, 9, 52, 785007, tzinfo=tzutc())

and I create a spark DataFrame

data_df = spark.createDataFrame([Row(date=utc_now)])

when I try to show the dataframe

data_df.show(10, False)

the column containing the data is in local time that is 2 hours front

>>> data_df.show(10, False)
+--------------------------+
|date                      |
+--------------------------+
|2018-02-12 15:09:52.785007|
+--------------------------+

and collecting the data has shifted time in the datetime object two hours front

>>> data_df.collect()
[Row(date=datetime.datetime(2018, 2, 12, 15, 9, 52, 785007))]

Zone info is also removed. Can this behavior be altered when casting to TimestampType?

MaFF · Accepted Answer

TimestampType in pyspark is not tz aware like in Pandas rather it passes long ints and displays them according to your machine's local time zone (by default).

That being said, you can change your spark session time zone, using 'spark.sql.session.timeZone'

from datetime import datetime
from dateutil import tz
from pyspark.sql import Row

utc_now = datetime.now().replace(tzinfo=tz.tzutc())
print(utc_now)

spark.conf.set('spark.sql.session.timeZone', 'Europe/Paris')
data_df = spark.createDataFrame([Row(date=utc_now)])
data_df.show(10, False)
print(data_df.collect())

    2018-02-12 20:41:16.270386+00:00
    +--------------------------+
    |date                      |
    +--------------------------+
    |2018-02-12 21:41:16.270386|
    +--------------------------+

    [Row(date=datetime.datetime(2018, 2, 12, 21, 41, 16, 270386))]


spark.conf.set('spark.sql.session.timeZone', 'UTC')
data_df2 = spark.createDataFrame([Row(date=utc_now)])
data_df2.show(10, False)
print(data_df2.collect())

    +--------------------------+
    |date                      |
    +--------------------------+
    |2018-02-12 20:41:16.270386|
    +--------------------------+

    [Row(date=datetime.datetime(2018, 2, 12, 21, 41, 16, 270386))]

As you can see Spark considers it as UTC but serves it back in the local timezone since Python still has time zone 'Europe/Paris'

import os, time
os.environ['TZ'] = 'UTC'
time.tzset()
utc_now = datetime.now()
spark.conf.set('spark.sql.session.timeZone', 'UTC')
data_df2 = spark.createDataFrame([Row(date=utc_now)])
data_df2.show(10, False)
print(data_df2.collect())

    +--------------------------+
    |date                      |
    +--------------------------+
    |2018-02-12 20:41:16.807757|
    +--------------------------+

    [Row(date=datetime.datetime(2018, 2, 12, 20, 41, 16, 807757))]

Moreover, pyspark.sql.module provides you with two functions to convert a timestamp object to another one corresponding to the same time of day (from_utc_timesamp, to_utc_timestamp). Although I don't think you want to alter your datetimes.

TimeStampType in Pyspark with datetime tzaware objects

Tags:

python

datetime

pyspark

Apostolos

1 Answers

MaFF

Recent Activity

Donate For Us

TimeStampType in Pyspark with datetime tzaware objects

Tags:

python

datetime

pyspark

Apostolos

1 Answers

MaFF

Related questions

Recent Activity

Donate For Us