spark reads a table from MySQL which has a timestamp column storing UTC timezone values. Spark is configured in local(IST). MySQL stores below timestamp values.

spark.conf.set("spark.sql.session.timeZone" , "UTC")
df.show(100,False)

after using above conf, I can see the correct records with df.show(). Later df.rdd.collect() converts these values back to IST timezone.
for row in df.rdd.collect():
print("row.Mindate ",row.Mindate)
row.Mindate 2021-03-02 19:30:31
row.Mindate 2021-04-01 14:05:03
row.Mindate 2021-06-15 11:39:40
row.Mindate 2021-07-07 18:14:17
row.Mindate 2021-08-03 10:48:51
row.Mindate 2021-10-06 10:21:11
spark dataframe and df.rdd show different result sets.
How does it change the values back to local timezone even after "spark.sql.session.timeZone" , "UTC".
Thanks in advance
EDIT 1:
df.printSchema()
root
|-- Mindate: timestamp (nullable = true)
|-- Maxdate: timestamp (nullable = true)
Make sure that your Spark timezone (spark.sql.session.timeZone) is set to the same timezone as your Python timezone (TZ environment variable). Spark will convert between the two whenever you call DataFrame.collect(). You can do this as follows:
import os
import time
# change Python timezone
os.environ["TZ"] = "UTC"
time.tzset()
# change Spark timezone
spark.conf.set("spark.sql.session.timeZone", "UTC")
I've run into the exact same issue, where a timestamp would be converted whenever I collect a DataFrame from Spark to Python. I wrote this following simple test to confirm the behaviour:
def test_timezone_string_to_timestamp_utc(spark):
spark.conf.set("spark.sql.session.timeZone", "UTC")
df = spark.createDataFrame([("1970-01-01 10:00:00",)], "ts STRING").withColumn("ts", f.col("ts").cast("timestamp"))
assert df.collect()[0]["ts"] == datetime.datetime(1970, 1, 1, 10)
Which fails with the message datetime.datetime(1970, 1, 1, 11, 0) != datetime.datetime(1970, 1, 1, 10, 0). What happens is that the timestamp is converted from UTC to my system default (CET) when the DataFrame is collected from the Spark TimeStampType to Python's datetime.datetime. I don't know why this happens, but it happens... I've tried looking in the source as @Kashyap also pointed out, but don't really see why this would happen.
The Python timezone defaults to the system default timezone, and can be configured by the TZ system variable. See Python docs
So the conversion happens whenever there is a difference between Python's timezone and Spark's timezone setting. We can check this as follows:
>>> spark.conf.get("spark.sql.session.timeZone")
'UTC'
>>> import time
>>> time.tzname
('CET', 'CEST')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With