Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Save DataFrame to Table - performance in Pyspark

I know there are two ways to save a DF to a table in Pyspark:

1) df.write.saveAsTable("MyDatabase.MyTable")

2) df.createOrReplaceTempView("TempView")
   spark.sql("CREATE TABLE MyDatabase.MyTable as select * from TempView")

Is there any difference in performance using a "CREATE TABLE AS " statement vs "saveAsTable" when running on a large distributed dataset?

like image 897
Nabil RIFKI Avatar asked Oct 18 '25 13:10

Nabil RIFKI


1 Answers

createOrReplaceTempView creates (or replaces if that view name already exists) a lazily evaluated "view" that can be uses as a table in Spark SQL. It is not materialized until you call an action (like count) or persisted to memory unless you call cache on the dataset that underpins the view. As the name suggests, this is just a temporary view. It is lost after your application/session ends.

saveAsTable on the other hand saves the data to external stores like hdfs or s3 or adls. This is permanent storage and lasts longer than scope of the SparkSession or Spark Application and is available for use later.

So the main difference is on the lifetime of dataset than the performance. Obviously, within the same job, working with cached data is faster.

like image 157
Aravind Yarram Avatar answered Oct 21 '25 15:10

Aravind Yarram



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!