Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark date_format() and hour() converting timestamp to localtime

EDIT: I am using pyspark 2.0.2 and can't use a higher version.

I have some source data with a timestamp field with zero offset, and I am simply trying to extract date and hour from this field. However, spark is converting this timestamp to local time (EDT in my case) before retrieving date and hour. Stripping T and Z from the timestamp field using a UDF and applying the same above functions works, but that seems like a silly way to go about what I need. Any thoughts?

from pyspark.sql import SparkSession
from pyspark.sql.functions import date_format, hour

spark = (
    SparkSession
    .builder
    .appName('junk')
    .getOrCreate()
    )
spark.sparkContext.setLogLevel('ERROR')

df = spark.createDataFrame(
    [(1, '2018-04-20T00:56:30.562Z'),
     (2, '2018-04-20T03:56:30.562Z'),
     (3, '2018-04-20T05:56:30.562Z')],
    ['id', 'ts']
    )
df = (
    df
    .withColumn(
        'event_dt',
        date_format(df.ts.cast('timestamp'), 'yyyy-MM-dd').cast('date')
        )
    .withColumn('event_hr', hour(df.ts))
    )
print(df.head(5))

Output as follows:

[
  Row(id=1, ts='2018-04-20T00:56:30.562Z', event_dt=datetime.date(2018, 4, 19), event_hr=20),
  Row(id=2, ts='2018-04-20T03:56:30.562Z', event_dt=datetime.date(2018, 4, 19), event_hr=23),
  Row(id=3, ts='2018-04-20T05:56:30.562Z', event_dt=datetime.date(2018, 4, 20), event_hr=1)
]

The following workaround works, but I am looking for something more straightforward, if possible:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

stripTz = udf(lambda x: x.replace('T', ' ').replace('Z', ''), StringType())
df = (
    df
    .withColumn('newts', stripTz(df.ts))
    )
df = (
    df
    .withColumn(
        'event_dt',
        date_format(df.newts.cast('timestamp'), 'yyyy-MM-dd').cast('date')
        )
    .withColumn('event_hr', hour(df.newts))
    .drop('newts')
    )

print(df.head(5))

New output as follows and as desired:

[
  Row(id=1, ts='2018-04-20T00:56:30.562Z', event_dt=datetime.date(2018, 4, 20), event_hr=0),
  Row(id=2, ts='2018-04-20T03:56:30.562Z', event_dt=datetime.date(2018, 4, 20), event_hr=3),
  Row(id=3, ts='2018-04-20T05:56:30.562Z', event_dt=datetime.date(2018, 4, 20), event_hr=5)
]
like image 603
Gopala Avatar asked Dec 20 '25 15:12

Gopala


1 Answers

What version of Spark are you using? In 2.2+ just set timezone for your sparksession as:

spark.conf.set("spark.sql.session.timeZone", "GMT")

Alternately,

df.select("id", "ts", pyspark.sql.functions.to_timestamp("ts").alias("timestamp"))

then change the timezone to whatever before extracting the day/hour

like image 96
maverik Avatar answered Dec 23 '25 03:12

maverik