Save DataFrame to Table - performance in Pyspark

Question

I know there are two ways to save a DF to a table in Pyspark:

1) df.write.saveAsTable("MyDatabase.MyTable")

2) df.createOrReplaceTempView("TempView")
   spark.sql("CREATE TABLE MyDatabase.MyTable as select * from TempView")

Is there any difference in performance using a "CREATE TABLE AS " statement vs "saveAsTable" when running on a large distributed dataset?

Aravind Yarram · Accepted Answer

createOrReplaceTempView creates (or replaces if that view name already exists) a lazily evaluated "view" that can be uses as a table in Spark SQL. It is not materialized until you call an action (like count) or persisted to memory unless you call cache on the dataset that underpins the view. As the name suggests, this is just a temporary view. It is lost after your application/session ends.

saveAsTable on the other hand saves the data to external stores like hdfs or s3 or adls. This is permanent storage and lasts longer than scope of the SparkSession or Spark Application and is available for use later.

So the main difference is on the lifetime of dataset than the performance. Obviously, within the same job, working with cached data is faster.

Save DataFrame to Table - performance in Pyspark

Tags:

apache-spark

pyspark

hive

Nabil RIFKI

1 Answers

Aravind Yarram

Recent Activity

Donate For Us

Save DataFrame to Table - performance in Pyspark

Tags:

apache-spark

pyspark

hive

Nabil RIFKI

1 Answers

Aravind Yarram

Related questions

Recent Activity

Donate For Us