String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

Tags:

I have a date string from a source in the format 'Fri May 24 00:00:00 BST 2019' that I would convert to a date and store in my dataframe as '2019-05-24' using code like my example which works for me under spark 2.0

from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime
df = spark.createDataFrame([("Fri May 24 00:00:00 BST 2019",)], ['date_str'])
df2 = df.select('date_str', to_date(from_unixtime(unix_timestamp('date_str', 'EEE MMM dd HH:mm:ss zzz yyyy'))).alias('date'))
df2.show(1, False)

In my sandbox environment I've updated to spark 3.0 and now get the following error for the above code, is there a new method of doing this in 3.0 to convert my string to a date

: org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter.

You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0.

You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

966

asked Jun 26 '20 20:06

lakeuk

2 Answers

If you want to use the legacy format in a newer version of spark(>3), you need to set spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY") or spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY"), which will resolve the issue.

answered Oct 24 '22 08:10

Shivam Tripathi

Thanks for responses, excellent advice, for the moment I'll be going with the LEGACY setting. I have a workaround with Spark 3.0 by substringing out the EEE element but I've noticed a bug with how BST timezone converts incorrectly offseting by 10 hours while under LEGACY it correctly remains the same as I'm currently in BST zone. I can do something with this but will wait till the clocks change in the autumn to confirm.

spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df = spark.createDataFrame([('Fri May 24 00:00:00 BST 2019',)], ['mydate'])
df = df.select('mydate',
                to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime'),
                to_timestamp(df.mydate, 'EEE MMM dd HH:mm:ss zzz yyyy').alias('LEGACYdatetime')
               ).show(1, False)

df = spark.createDataFrame([('Fri May 24 00:00:00 GMT 2019',)], ['mydate'])
df = df.select('mydate', 
                to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime'),
                to_timestamp(df.mydate, 'EEE MMM dd HH:mm:ss zzz yyyy').alias('LEGACYdatetime')
               ).show(1, False)

spark.sql("set spark.sql.legacy.timeParserPolicy=CORRECTED")
df = spark.createDataFrame([('Fri May 24 00:00:00 BST 2019',)], ['mydate'])
df = df.select('mydate', 
                to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime')          
               ).show(1, False)

df = spark.createDataFrame([('Fri May 24 00:00:00 GMT 2019',)], ['mydate'])
df = df.select('mydate', 
                to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime')           
               ).show(1, False)

+----------------------------+-------------------+-------------------+
|mydate                      |datetime           |LEGACYdatetime     |
+----------------------------+-------------------+-------------------+
|Fri May 24 00:00:00 BST 2019|2019-05-24 00:00:00|2019-05-24 00:00:00|
+----------------------------+-------------------+-------------------+

+----------------------------+-------------------+-------------------+
|mydate                      |datetime           |LEGACYdatetime     |
+----------------------------+-------------------+-------------------+
|Fri May 24 00:00:00 GMT 2019|2019-05-24 01:00:00|2019-05-24 01:00:00|
+----------------------------+-------------------+-------------------+

+----------------------------+-------------------+
|mydate                      |datetime           |
+----------------------------+-------------------+
|Fri May 24 00:00:00 BST 2019|2019-05-23 14:00:00|
+----------------------------+-------------------+

+----------------------------+-------------------+
|mydate                      |datetime           |
+----------------------------+-------------------+
|Fri May 24 00:00:00 GMT 2019|2019-05-24 01:00:00|
+----------------------------+-------------------+

answered Oct 24 '22 09:10

lakeuk

Related questions
                            
                                pyspark merge two rdd together
                            
                                How to make onehotencoder in Spark to work like onehotencoder in Pandas?
                            
                                How long does RDD remain in memory?
                            
                                Pyspark ML - How to save pipeline and RandomForestClassificationModel
                            
                                Efficient string suffix detection
                            
                                Spark / Scala: Passing RDD to Function
                            
                                Why do I have to explicitly tell Spark what to cache?
                            
                                How to apply a function to a column of a Spark DataFrame?
                            
                                How do I convert column of unix epoch to Date in Apache spark DataFrame using Java?
                            
                                Query in Spark SQL inside an array
                            
                                Spark list all cached RDD names and unpersist
                            
                                Request insufficient authentication scopes when running Spark-Job on dataproc
                            
                                Unresolved reference while trying to import col from pyspark.sql.functions in python 3.5
                            
                                IllegalArgumentException thrown when count and collect function in spark
                            
                                could not read data from json using pyspark
                            
                                How to add days (as values of a column) to date?
                            
                                No module named graphframes Jupyter Notebook
                            
                                How to change number of executors in local mode?
                            
                                partitionBy & overwrite strategy in an Azure DataLake using PySpark in Databricks
                            
                                How can I pass a list of columns to select in pyspark dataframe?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

Tags:

apache-spark

apache-spark-sql

pyspark

lakeuk

People also ask

2 Answers

Shivam Tripathi

lakeuk

Recent Activity

Donate For Us