I'm wondering if there are differences in performance (when reading) between those two commands?:
df.write.format('parquet').partitionBy(xx).save('/.../xx.parquet')
df.write.format('parquet').partitionBy(xx).saveAsTable('...')
I understand that for bucketing the question doesn't arise as it is only used with managed tables (saveAsTable()) ; however, I'm a bit confused regarding partitioning as to if there is a method to privilege.
I've tried to find an answer experimentaly on a small dataframe and here are the results :
ENV = Databricks Community edition
[Attached to cluster: test, 15.25 GB | 2 Cores | DBR 7.4 | Spark 3.0.1 | Scala 2.12]
sqlContext.setConf( "spark.sql.shuffle.partitions", 2)
spark.conf.set("spark.sql.adaptive.enabled","true")
df.count() = 693243
RESULTS :
As expected writing using .saveAsTable() is a bit longer because it has to execute a dedicated "CreateDataSourceTableAsSelectCommand" to actually create the table. However, it is interesting to observe the difference when reading in favor of .saveAsTable() by nearly a factor of x10 in this simple example. I'd be very interested to compare the results on a much larger scale if someone ever has the ability to do it, and to understand what happens under the hood.

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With