Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark 3.0 is much slower to read json files than Spark 2.4

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.

Spark 2.4

scala> spark.time(spark.read.json("/data/20200528"))
Time taken: 19691 ms
res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res61.count())
Time taken: 7113 ms
res64: Long = 2605349

Spark 3.0

scala> spark.time(spark.read.json("/data/20200528"))
20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Time taken: 849652 ms
res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res0.count())
Time taken: 8201 ms
res2: Long = 2605349

Here are the details:

enter image description here

like image 290
smishra Avatar asked Jun 27 '20 23:06

smishra


1 Answers

As it turns out default behavior of Spark 3.0 has changed - it tries to infer timestamp unless schema is specified and that results into huge amount of text scan. I tried to load the data with inferTimestamp=false time did come close to that of Spark 2.4 but Spark 2.4 still beats Spark 3 by ~3+ sec (may be in acceptable range but question is why?). I have no idea why this behavior was changed but its should have been notified in BOLD letters.

Spark 2.4

spark.time(spark.read.option("inferTimestamp","false").json("/data/20200528/").count)
Time taken: 29706 ms
res0: Long = 2605349



spark.time(spark.read.option("inferTimestamp","false").option("prefersDecimal","false").json("/data/20200528/").count)
Time taken: 31431 ms
res0: Long = 2605349

Spark 3.0

spark.time(spark.read.option("inferTimestamp","false").json("/data/20200528/").count)
Time taken: 32826 ms
res0: Long = 2605349
 
spark.time(spark.read.option("inferTimestamp","false").option("prefersDecimal","false").json("/data/20200528/").count)
Time taken: 34011 ms
res0: Long = 2605349

Note:

  • Make sure you never turn on prefersDecimal to true even when inferTimestamp is false, it again takes huge amount of time.
  • Spark 3.0 + JDK 11 is slower than Spark 3.0 + JDK 8 by almost 6 sec.
like image 131
smishra Avatar answered Sep 20 '22 10:09

smishra