I'm trying to write a <code>DataFrame</code> into <code>Hive</code> table (on <code>S3</code>) in <code>Overwrite</code> mode (necessary for my application) and need to decide between two methods of DataFrameWriter (Spark / Scala). From what I can read in the documentation, <code>df.write.saveAsTable</code> differs from <code>df.write.insertInto</code> in the following respects: <ul> <li> <code>saveAsTable</code> uses column-name based resolution while <code>insertInto</code> uses position-based resolution </li> <li>In Append mode, <code>saveAsTable</code> pays more attention to underlying schema of the existing table to make certain resolutions</li> </ul> Overall, it gives me the impression that <code>saveAsTable</code> is just a smarter version of <code>insertInto</code>. Alternatively, depending on use-case, one might prefer <code>insertInto</code> But do each of these methods come with some caveats of their own like performance penalty in case of <code>saveAsTable</code> (since it packs in more features)? Are there any other differences in their behaviours apart from what is told (not very clearly) in the docs? <hr> EDIT-1 Documentation says this regarding <code>insertInto</code> <blockquote> Inserts the content of the DataFrame to the specified table </blockquote> and this for <code>saveAsTable</code> <blockquote> In the case the table already exists, behavior of this function depends on the save mode, specified by the mode function </blockquote> Now I can list my doubts <ul> <li>Does <code>insertInto</code> always expect the table to exist?</li> <li>Do <code>SaveMode</code>s have any impact on <code>insertInto</code>?</li> <li>If above answer is yes, then <ul> <li>what's the differences between <code>saveAsTable</code> with <code>SaveMode.Append</code> and <code>insertInto</code> given that table already exists?</li> <li>does <code>insertInto</code> with <code>SaveMode.Overwrite</code> make any sense?</li> </ul> </li> </ul>

I want to point out a major difference between <code>SaveAsTable</code> and <code>insertInto</code> in SPARK. In partitioned table <code>overwrite</code> SaveMode work differently in case of <code>SaveAsTable</code> and <code>insertInto</code>. Consider below example.Where I am creating partitioned table using <code>SaveAsTable</code> method. <pre class="prettyprint"><code>hive> CREATE TABLE `db.companies_table`(`company` string) PARTITIONED BY ( `id` date); OK Time taken: 0.094 seconds </code></pre> <pre class="prettyprint"><code>import org.apache.spark.sql._* import spark.implicits._ import org.apache.spark.sql._ scala>val targetTable = "db.companies_table" scala>val companiesDF = Seq(("2020-01-01", "Company1"), ("2020-01-02", "Company2")).toDF("id", "company") scala>companiesDF.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable(targetTable) scala> spark.sql("select * from db.companies_table").show() +--------+----------+ | company| id| +--------+----------+ |Company1|2020-01-01| |Company2|2020-01-02| +--------+----------+ </code></pre> Now I am adding 2 new rows with 2 new partition values. <pre class="prettyprint"><code>scala> val companiesDF = Seq(("2020-01-03", "Company1"), ("2020-01-04", "Company2")).toDF("id", "company") scala> companiesDF.write.mode(SaveMode.Append).partitionBy("id").saveAsTable(targetTable) scala>spark.sql("select * from db.companies_table").show() +--------+----------+ | company| id| +--------+----------+ |Company1|2020-01-01| |Company2|2020-01-02| |Company1|2020-01-03| |Company2|2020-01-04| +--------+----------+ </code></pre> As you can see 2 new rows are added to the table. Now let`s say i want to <code>Overwrite</code> partition 2020-01-02 data. <pre class="prettyprint"><code>scala> val companiesDF = Seq(("2020-01-02", "Company5")).toDF("id", "company") scala>companiesDF.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable(targetTable) </code></pre> As per our logic only partitions 2020-01-02 should be overwritten but the case with <code>SaveAsTable</code> is different.It will overwrite the enter table as you can see below. <pre class="prettyprint"><code>scala> spark.sql("select * from db.companies_table").show() +-------+----------+ |company| id| +-------+----------+ |Company5|2020-01-02| +-------+----------+ </code></pre> So if we want to overwrite only certain partitions in the table using <code>SaveAsTable</code> its not possible. Refer this Link for more details. https://towardsdatascience.com/understanding-the-spark-insertinto-function-1870175c3ee9

What are the differences between saveAsTable and insertInto in different SaveMode(s)?

Tags:

apache-spark

I'm trying to write a DataFrame into Hive table (on S3) in Overwrite mode (necessary for my application) and need to decide between two methods of DataFrameWriter (Spark / Scala). From what I can read in the documentation, df.write.saveAsTable differs from df.write.insertInto in the following respects:

saveAsTable uses column-name based resolution while insertInto uses position-based resolution
In Append mode, saveAsTable pays more attention to underlying schema of the existing table to make certain resolutions

Overall, it gives me the impression that saveAsTable is just a smarter version of insertInto. Alternatively, depending on use-case, one might prefer insertInto

But do each of these methods come with some caveats of their own like performance penalty in case of saveAsTable (since it packs in more features)? Are there any other differences in their behaviours apart from what is told (not very clearly) in the docs?

EDIT-1

Documentation says this regarding insertInto

Inserts the content of the DataFrame to the specified table

and this for saveAsTable

In the case the table already exists, behavior of this function depends on the save mode, specified by the mode function

Now I can list my doubts

Does insertInto always expect the table to exist?
Do SaveModes have any impact on insertInto?
If above answer is yes, then
- what's the differences between saveAsTable with SaveMode.Append and insertInto given that table already exists?
- does insertInto with SaveMode.Overwrite make any sense?

438

asked Dec 16 '17 10:12

y2k-shubham

Video Answer

2 Answers

DISCLAIMER I've been exploring insertInto for some time and although I'm far from an expert in this area I'm sharing the findings for greater good.

Does insertInto always expect the table to exist?

Yes (per the table name and the database).

Moreover not all tables can be inserted into, i.e. a (permanent) table, a temporary view or a temporary global view are fine, but not:

a bucketed table
an RDD-based table

Do SaveModes have any impact on insertInto?

(That's recently been my question, too!)

Yes, but only SaveMode.Overwrite. After you think about insertInto the other 3 save modes don't make much sense (as it simply inserts a dataset).

what's the differences between saveAsTable with SaveMode.Append and insertInto given that table already exists?

That's a very good question! I'd say none, but let's see by just one example (hoping that proves something).

scala> spark.version res13: String = 2.4.0-SNAPSHOT  sql("create table my_table (id long)") scala> spark.range(3).write.mode("append").saveAsTable("my_table") org.apache.spark.sql.AnalysisException: The format of the existing table default.my_table is `HiveFileFormat`. It doesn't match the specified format `ParquetFileFormat`.;   at org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:117)   at org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:76) ... scala> spark.range(3).write.insertInto("my_table") scala> spark.table("my_table").show +---+ | id| +---+ |  2| |  0| |  1| +---+

does insertInto with SaveMode.Overwrite make any sense?

I think so given it pays so much attention to SaveMode.Overwrite. It simply re-creates the target table.

spark.range(3).write.mode("overwrite").insertInto("my_table") scala> spark.table("my_table").show +---+ | id| +---+ |  1| |  0| |  2| +---+  Seq(100, 200, 300).toDF.write.mode("overwrite").insertInto("my_table") scala> spark.table("my_table").show +---+ | id| +---+ |200| |100| |300| +---+

answered Sep 21 '22 15:09

Jacek Laskowski

I want to point out a major difference between SaveAsTable and insertInto in SPARK.

In partitioned table overwrite SaveMode work differently in case of SaveAsTable and insertInto.

Consider below example.Where I am creating partitioned table using SaveAsTable method.

hive> CREATE TABLE `db.companies_table`(`company` string) PARTITIONED BY ( `id` date); OK Time taken: 0.094 seconds

import org.apache.spark.sql._* import spark.implicits._ import org.apache.spark.sql._  scala>val targetTable = "db.companies_table"  scala>val companiesDF = Seq(("2020-01-01", "Company1"), ("2020-01-02", "Company2")).toDF("id", "company")  scala>companiesDF.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable(targetTable)  scala> spark.sql("select * from db.companies_table").show() +--------+----------+ | company|        id| +--------+----------+ |Company1|2020-01-01| |Company2|2020-01-02| +--------+----------+

Now I am adding 2 new rows with 2 new partition values.

scala> val companiesDF = Seq(("2020-01-03", "Company1"), ("2020-01-04", "Company2")).toDF("id", "company")  scala> companiesDF.write.mode(SaveMode.Append).partitionBy("id").saveAsTable(targetTable)  scala>spark.sql("select * from db.companies_table").show()  +--------+----------+                                                            | company|        id| +--------+----------+ |Company1|2020-01-01| |Company2|2020-01-02| |Company1|2020-01-03| |Company2|2020-01-04| +--------+----------+

As you can see 2 new rows are added to the table.

Now let`s say i want to Overwrite partition 2020-01-02 data.

scala> val companiesDF = Seq(("2020-01-02", "Company5")).toDF("id", "company")  scala>companiesDF.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable(targetTable)

As per our logic only partitions 2020-01-02 should be overwritten but the case with SaveAsTable is different.It will overwrite the enter table as you can see below.

scala> spark.sql("select * from db.companies_table").show() +-------+----------+ |company|        id| +-------+----------+ |Company5|2020-01-02| +-------+----------+

So if we want to overwrite only certain partitions in the table using SaveAsTable its not possible.

Refer this Link for more details. https://towardsdatascience.com/understanding-the-spark-insertinto-function-1870175c3ee9

answered Sep 17 '22 15:09

Sandeep Khot

Related questions
                            
                                How to find the master URL for an existing spark cluster
                            
                                What's the most efficient way to filter a DataFrame
                            
                                Warnings while building Scala/Spark project with SBT
                            
                                Spark DataFrame: does groupBy after orderBy maintain that order?
                            
                                Difference between createOrReplaceTempView and registerTempTable
                            
                                Adding a group count column to a PySpark dataframe
                            
                                how to get max(date) from given set of data grouped by some fields using pyspark?
                            
                                Google Dataflow vs Apache Spark
                            
                                Building a row from a dict in pySpark
                            
                                Column name with dot spark
                            
                                How to uncache RDD?
                            
                                Spark Equivalent of IF Then ELSE
                            
                                apache spark - check if file exists
                            
                                Would Spark unpersist the RDD itself when it realizes it won't be used anymore?
                            
                                Debugging "Managed memory leak detected" in Spark 1.6.0
                            
                                How to check status of Spark applications from the command line?
                            
                                Spark 2.0 Dataset vs DataFrame
                            
                                Methods for writing Parquet files using Python?
                            
                                Extremely slow S3 write times from EMR/ Spark
                            
                                The value of "spark.yarn.executor.memoryOverhead" setting?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With