I'm trying to write a DataFrame
into Hive
table (on S3
) in Overwrite
mode (necessary for my application) and need to decide between two methods of DataFrameWriter (Spark / Scala). From what I can read in the documentation, df.write.saveAsTable
differs from df.write.insertInto
in the following respects:
saveAsTable
uses column-name based resolution while insertInto
uses position-based resolution saveAsTable
pays more attention to underlying schema of the existing table to make certain resolutionsOverall, it gives me the impression that saveAsTable
is just a smarter version of insertInto
. Alternatively, depending on use-case, one might prefer insertInto
But do each of these methods come with some caveats of their own like performance penalty in case of saveAsTable
(since it packs in more features)? Are there any other differences in their behaviours apart from what is told (not very clearly) in the docs?
EDIT-1
Documentation says this regarding insertInto
Inserts the content of the DataFrame to the specified table
and this for saveAsTable
In the case the table already exists, behavior of this function depends on the save mode, specified by the mode function
Now I can list my doubts
insertInto
always expect the table to exist?SaveMode
s have any impact on insertInto
?saveAsTable
with SaveMode.Append
and insertInto
given that table already exists?insertInto
with SaveMode.Overwrite
make any sense?saveAsTable("mytable") , the table is actually written to storage (HDFS/ S3). It is a Spark action. On the other hand: df. createOrReplaceTempView("my_temp_table") is a transformation. It is just an identifier to be used for the DAG of df .
saveAsTable("t") . When the table is dropped, the custom table path will not be removed and the table data is still there. If no custom table path is specified, Spark will write data to a default table path under the warehouse directory. When the table is dropped, the default table path will be removed too.
createOrReplaceTempView. Creates or replaces a local temporary view with this DataFrame . The lifetime of this temporary table is tied to the SparkSession that was used to create this DataFrame . New in version 2.0.
DISCLAIMER I've been exploring insertInto
for some time and although I'm far from an expert in this area I'm sharing the findings for greater good.
Does
insertInto
always expect the table to exist?
Yes (per the table name and the database).
Moreover not all tables can be inserted into, i.e. a (permanent) table, a temporary view or a temporary global view are fine, but not:
a bucketed table
an RDD-based table
Do SaveModes have any impact on insertInto?
(That's recently been my question, too!)
Yes, but only SaveMode.Overwrite. After you think about insertInto
the other 3 save modes don't make much sense (as it simply inserts a dataset).
what's the differences between saveAsTable with SaveMode.Append and insertInto given that table already exists?
That's a very good question! I'd say none, but let's see by just one example (hoping that proves something).
scala> spark.version res13: String = 2.4.0-SNAPSHOT sql("create table my_table (id long)") scala> spark.range(3).write.mode("append").saveAsTable("my_table") org.apache.spark.sql.AnalysisException: The format of the existing table default.my_table is `HiveFileFormat`. It doesn't match the specified format `ParquetFileFormat`.; at org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:117) at org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:76) ... scala> spark.range(3).write.insertInto("my_table") scala> spark.table("my_table").show +---+ | id| +---+ | 2| | 0| | 1| +---+
does insertInto with SaveMode.Overwrite make any sense?
I think so given it pays so much attention to SaveMode.Overwrite
. It simply re-creates the target table.
spark.range(3).write.mode("overwrite").insertInto("my_table") scala> spark.table("my_table").show +---+ | id| +---+ | 1| | 0| | 2| +---+ Seq(100, 200, 300).toDF.write.mode("overwrite").insertInto("my_table") scala> spark.table("my_table").show +---+ | id| +---+ |200| |100| |300| +---+
I want to point out a major difference between SaveAsTable
and insertInto
in SPARK.
In partitioned table overwrite
SaveMode work differently in case of SaveAsTable
and insertInto
.
Consider below example.Where I am creating partitioned table using SaveAsTable
method.
hive> CREATE TABLE `db.companies_table`(`company` string) PARTITIONED BY ( `id` date); OK Time taken: 0.094 seconds
import org.apache.spark.sql._* import spark.implicits._ import org.apache.spark.sql._ scala>val targetTable = "db.companies_table" scala>val companiesDF = Seq(("2020-01-01", "Company1"), ("2020-01-02", "Company2")).toDF("id", "company") scala>companiesDF.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable(targetTable) scala> spark.sql("select * from db.companies_table").show() +--------+----------+ | company| id| +--------+----------+ |Company1|2020-01-01| |Company2|2020-01-02| +--------+----------+
Now I am adding 2 new rows with 2 new partition values.
scala> val companiesDF = Seq(("2020-01-03", "Company1"), ("2020-01-04", "Company2")).toDF("id", "company") scala> companiesDF.write.mode(SaveMode.Append).partitionBy("id").saveAsTable(targetTable) scala>spark.sql("select * from db.companies_table").show() +--------+----------+ | company| id| +--------+----------+ |Company1|2020-01-01| |Company2|2020-01-02| |Company1|2020-01-03| |Company2|2020-01-04| +--------+----------+
As you can see 2 new rows are added to the table.
Now let`s say i want to Overwrite
partition 2020-01-02 data.
scala> val companiesDF = Seq(("2020-01-02", "Company5")).toDF("id", "company") scala>companiesDF.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable(targetTable)
As per our logic only partitions 2020-01-02 should be overwritten but the case with SaveAsTable
is different.It will overwrite the enter table as you can see below.
scala> spark.sql("select * from db.companies_table").show() +-------+----------+ |company| id| +-------+----------+ |Company5|2020-01-02| +-------+----------+
So if we want to overwrite only certain partitions in the table using SaveAsTable
its not possible.
Refer this Link for more details. https://towardsdatascience.com/understanding-the-spark-insertinto-function-1870175c3ee9
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With