Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER

I am very new to pyspark and getting below error, even if drop all date related columns or selecting only one column. Date format stored in my data frame like "enter image description here". Can anyone please suggest changes I could made in dataframe to resolve this/date formats supported by new parser. It's working if I set "spark.sql.legacy.timeParserPolicy" to "LEGACY"

[INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0: Caused by: DateTimeParseException: Text '1/1/2023 3:57:22 AM' could not be parsed at index 0 org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 15.0 failed 4 times, most recent failure: Lost task 4.3 in stage 15.0 (TID 355) (10.139.64.5 executor 0): org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0: Fail to parse '1/1/2023 3:57:22 AM' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.

Example:

#spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
from pyspark.sql.functions import *
from pyspark.sql import functions as F

emp = [(1, "AAA", "dept1", 1000, "12/22/2022  3:11:44 AM"),
(2, "BBB", "dept1", 1100, "12/22/2022  3:11:44 AM"),
(3, "CCC", "dept1", 3000, "12/22/2022  3:11:44 AM"),
(4, "DDD", "dept1", 1500, "12/22/2022  3:11:44 AM"),
(5, "EEE", "dept2", 8000, "12/22/2022  3:11:44 AM"),
(6, "FFF", "dept2", 7200, "12/22/2022  3:11:44 AM"),
(7, "GGG", "dept3", 7100, "12/22/2022  3:11:44 AM"),
(8, "HHH", "dept3", 3700, "12/22/2022  3:11:44 PM"),
(9, "III", "dept3", 4500, "12/22/2022  3:11:44 PM"),
(10, "JJJ", "dept5", 3400,"12/22/2022 3:11:44 PM")]
empdf = spark.createDataFrame(emp, ["id", "name", "dept", "salary", 
"date"])

#empdf.printSchema()
df = empdf.withColumn("date", F.to_timestamp(col("date"), 
"MM/dd/yyyy hh:mm:ss a"))
df.show(12,False)

Thanks a lot, in Advance

like image 385
Shailendra Kumar Avatar asked Nov 26 '25 03:11

Shailendra Kumar


1 Answers

In general the new parser does not seem to like normal date patterns like MM or dd but prefers M or d.

Example 1:

# Does not work: 
df = df.withColumn("date", F.to_timestamp(col("date"), "MM/dd/yyyy  hh:mm:ss a"))

# Does work:
df = pdf.withColumn("date", F.to_timestamp(col("date"), "M/d/yyyy h:m:s a"))

Example 2

# Does not work: 
MM/dd/yyyy H:mm:ss a

# Does work: 
MM/d/yyyy H:mm:ss a

In addition the parser seems to be less robust against things that are not explicitly defined in the pattern, e.g. when parsing dates from timestamps.

Example 3

String example: 02.03.2004 10:35 `

# Does not work: 
df.withColumn(
  "parsed_date", 
  psf.to_date("unparsed_date", "dd.MM.yyyy"))

# Does work: 
df.withColumn(
  "parsed_date", 
  psf.to_date(psf.col("unparsed_date").substr(0, 10), "d.M.y"))
like image 59
Shailendra Kumar Avatar answered Nov 28 '25 02:11

Shailendra Kumar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!