Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark 3.0 - Reading performance when saved using .save() or .saveAsTable()

I'm wondering if there are differences in performance (when reading) between those two commands?:

df.write.format('parquet').partitionBy(xx).save('/.../xx.parquet') df.write.format('parquet').partitionBy(xx).saveAsTable('...')

I understand that for bucketing the question doesn't arise as it is only used with managed tables (saveAsTable()) ; however, I'm a bit confused regarding partitioning as to if there is a method to privilege.

like image 233
ZaraThoustra Avatar asked Oct 31 '25 09:10

ZaraThoustra


1 Answers

I've tried to find an answer experimentaly on a small dataframe and here are the results :

ENV = Databricks Community edition 
      [Attached to cluster: test, 15.25 GB | 2 Cores | DBR 7.4 | Spark 3.0.1 | Scala 2.12]

sqlContext.setConf( "spark.sql.shuffle.partitions", 2)
spark.conf.set("spark.sql.adaptive.enabled","true")

df.count() = 693243

RESULTS :

As expected writing using .saveAsTable() is a bit longer because it has to execute a dedicated "CreateDataSourceTableAsSelectCommand" to actually create the table. However, it is interesting to observe the difference when reading in favor of .saveAsTable() by nearly a factor of x10 in this simple example. I'd be very interested to compare the results on a much larger scale if someone ever has the ability to do it, and to understand what happens under the hood.

enter image description here

like image 98
ZaraThoustra Avatar answered Nov 02 '25 22:11

ZaraThoustra



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!