INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER

Question

I am very new to pyspark and getting below error, even if drop all date related columns or selecting only one column. Date format stored in my data frame like " enter image description here ". Can anyone please suggest changes I could made in dataframe to resolve this/date formats supported by new parser. It's working if I set "spark.sql.legacy.timeParserPolicy" to "LEGACY"

[INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0: Caused by: DateTimeParseException: Text '1/1/2023 3:57:22 AM' could not be parsed at index 0 org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 15.0 failed 4 times, most recent failure: Lost task 4.3 in stage 15.0 (TID 355) (10.139.64.5 executor 0): org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0: Fail to parse '1/1/2023 3:57:22 AM' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.

Example:

#spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
from pyspark.sql.functions import *
from pyspark.sql import functions as F

emp = [(1, "AAA", "dept1", 1000, "12/22/2022  3:11:44 AM"),
(2, "BBB", "dept1", 1100, "12/22/2022  3:11:44 AM"),
(3, "CCC", "dept1", 3000, "12/22/2022  3:11:44 AM"),
(4, "DDD", "dept1", 1500, "12/22/2022  3:11:44 AM"),
(5, "EEE", "dept2", 8000, "12/22/2022  3:11:44 AM"),
(6, "FFF", "dept2", 7200, "12/22/2022  3:11:44 AM"),
(7, "GGG", "dept3", 7100, "12/22/2022  3:11:44 AM"),
(8, "HHH", "dept3", 3700, "12/22/2022  3:11:44 PM"),
(9, "III", "dept3", 4500, "12/22/2022  3:11:44 PM"),
(10, "JJJ", "dept5", 3400,"12/22/2022 3:11:44 PM")]
empdf = spark.createDataFrame(emp, ["id", "name", "dept", "salary", 
"date"])

#empdf.printSchema()
df = empdf.withColumn("date", F.to_timestamp(col("date"), 
"MM/dd/yyyy hh:mm:ss a"))
df.show(12,False)

Thanks a lot, in Advance

Shailendra Kumar · Accepted Answer

In general the new parser does not seem to like normal date patterns like MM or dd but prefers M or d.

Example 1:

# Does not work: 
df = df.withColumn("date", F.to_timestamp(col("date"), "MM/dd/yyyy  hh:mm:ss a"))

# Does work:
df = pdf.withColumn("date", F.to_timestamp(col("date"), "M/d/yyyy h:m:s a"))

Example 2

# Does not work: 
MM/dd/yyyy H:mm:ss a

# Does work: 
MM/d/yyyy H:mm:ss a

In addition the parser seems to be less robust against things that are not explicitly defined in the pattern, e.g. when parsing dates from timestamps.

Example 3

String example: 02.03.2004 10:35 `

# Does not work: 
df.withColumn(
  "parsed_date", 
  psf.to_date("unparsed_date", "dd.MM.yyyy"))

# Does work: 
df.withColumn(
  "parsed_date", 
  psf.to_date(psf.col("unparsed_date").substr(0, 10), "d.M.y"))

INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER

Tags:

apache-spark-sql

pyspark

databricks-sql

Shailendra Kumar

1 Answers

Shailendra Kumar

Recent Activity

Donate For Us

INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER

Tags:

apache-spark-sql

pyspark

databricks-sql

Shailendra Kumar

1 Answers

Shailendra Kumar

Related questions

Recent Activity

Donate For Us