Spark 3.0 - Reading performance when saved using .save() or .saveAsTable()

Question

I'm wondering if there are differences in performance (when reading) between those two commands?:

df.write.format('parquet').partitionBy(xx).save('/.../xx.parquet') df.write.format('parquet').partitionBy(xx).saveAsTable('...')

I understand that for bucketing the question doesn't arise as it is only used with managed tables (saveAsTable()) ; however, I'm a bit confused regarding partitioning as to if there is a method to privilege.

ZaraThoustra · Accepted Answer

I've tried to find an answer experimentaly on a small dataframe and here are the results :

ENV = Databricks Community edition 
      [Attached to cluster: test, 15.25 GB | 2 Cores | DBR 7.4 | Spark 3.0.1 | Scala 2.12]

sqlContext.setConf( "spark.sql.shuffle.partitions", 2)
spark.conf.set("spark.sql.adaptive.enabled","true")

df.count() = 693243

RESULTS :

As expected writing using .saveAsTable() is a bit longer because it has to execute a dedicated "CreateDataSourceTableAsSelectCommand" to actually create the table. However, it is interesting to observe the difference when reading in favor of .saveAsTable() by nearly a factor of x10 in this simple example. I'd be very interested to compare the results on a much larger scale if someone ever has the ability to do it, and to understand what happens under the hood.

enter image description here

Spark 3.0 - Reading performance when saved using .save() or .saveAsTable()

Tags:

apache-spark-sql

pyspark

ZaraThoustra

1 Answers

ZaraThoustra

Recent Activity

Donate For Us

Spark 3.0 - Reading performance when saved using .save() or .saveAsTable()

Tags:

apache-spark-sql

pyspark

ZaraThoustra

1 Answers

ZaraThoustra

Related questions

Recent Activity

Donate For Us