I know there are two ways to save a DF to a table in Pyspark:
1) df.write.saveAsTable("MyDatabase.MyTable")
2) df.createOrReplaceTempView("TempView")
spark.sql("CREATE TABLE MyDatabase.MyTable as select * from TempView")
Is there any difference in performance using a "CREATE TABLE AS " statement vs "saveAsTable" when running on a large distributed dataset?
createOrReplaceTempView
creates (or replaces if that view name already exists) a lazily evaluated "view" that can be uses as a table in Spark SQL. It is not materialized until you call an action (like count
) or persisted to memory
unless you call cache
on the dataset that underpins the view. As the name suggests, this is just a temporary view. It is lost after your application/session ends.
saveAsTable
on the other hand saves the data to external stores like hdfs
or s3
or adls
. This is permanent storage and lasts longer than scope of the SparkSession or Spark Application and is available for use later.
So the main difference is on the lifetime of dataset than the performance. Obviously, within the same job, working with cached data is faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With