Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TimeStampType in Pyspark with datetime tzaware objects

I have the following issue I cannot fully understand in Pyspark. I have the following datetime objects

utc_now = datetime.now().replace(tzinfo=tz.tzutc())
utc_now # datetime.datetime(2018, 2, 12, 13, 9, 52, 785007, tzinfo=tzutc())

and I create a spark DataFrame

data_df = spark.createDataFrame([Row(date=utc_now)])

when I try to show the dataframe

data_df.show(10, False)

the column containing the data is in local time that is 2 hours front

>>> data_df.show(10, False)
+--------------------------+
|date                      |
+--------------------------+
|2018-02-12 15:09:52.785007|
+--------------------------+

and collecting the data has shifted time in the datetime object two hours front

>>> data_df.collect()
[Row(date=datetime.datetime(2018, 2, 12, 15, 9, 52, 785007))]

Zone info is also removed. Can this behavior be altered when casting to TimestampType?

like image 794
Apostolos Avatar asked Feb 12 '18 12:02

Apostolos


1 Answers

TimestampType in pyspark is not tz aware like in Pandas rather it passes long ints and displays them according to your machine's local time zone (by default).

That being said, you can change your spark session time zone, using 'spark.sql.session.timeZone'

from datetime import datetime
from dateutil import tz
from pyspark.sql import Row

utc_now = datetime.now().replace(tzinfo=tz.tzutc())
print(utc_now)

spark.conf.set('spark.sql.session.timeZone', 'Europe/Paris')
data_df = spark.createDataFrame([Row(date=utc_now)])
data_df.show(10, False)
print(data_df.collect())

    2018-02-12 20:41:16.270386+00:00
    +--------------------------+
    |date                      |
    +--------------------------+
    |2018-02-12 21:41:16.270386|
    +--------------------------+

    [Row(date=datetime.datetime(2018, 2, 12, 21, 41, 16, 270386))]


spark.conf.set('spark.sql.session.timeZone', 'UTC')
data_df2 = spark.createDataFrame([Row(date=utc_now)])
data_df2.show(10, False)
print(data_df2.collect())

    +--------------------------+
    |date                      |
    +--------------------------+
    |2018-02-12 20:41:16.270386|
    +--------------------------+

    [Row(date=datetime.datetime(2018, 2, 12, 21, 41, 16, 270386))]

As you can see Spark considers it as UTC but serves it back in the local timezone since Python still has time zone 'Europe/Paris'

import os, time
os.environ['TZ'] = 'UTC'
time.tzset()
utc_now = datetime.now()
spark.conf.set('spark.sql.session.timeZone', 'UTC')
data_df2 = spark.createDataFrame([Row(date=utc_now)])
data_df2.show(10, False)
print(data_df2.collect())

    +--------------------------+
    |date                      |
    +--------------------------+
    |2018-02-12 20:41:16.807757|
    +--------------------------+

    [Row(date=datetime.datetime(2018, 2, 12, 20, 41, 16, 807757))]

Moreover, pyspark.sql.module provides you with two functions to convert a timestamp object to another one corresponding to the same time of day (from_utc_timesamp, to_utc_timestamp). Although I don't think you want to alter your datetimes.

like image 74
MaFF Avatar answered Oct 06 '22 22:10

MaFF