Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

apache spark sql table overwrite issue

I am using the below code to create a table from a dataframe in databricks and run into error.

df.write.saveAsTable("newtable")

This works fine the very first time but for re-usability if I were to rewrite like below

df.write.mode(SaveMode.Overwrite).saveAsTable("newtable")

I get the following error.

Error Message:

org.apache.spark.sql.AnalysisException: Can not create the managed table newtable. The associated location dbfs:/user/hive/warehouse/newtable already exists
like image 595
paone Avatar asked Sep 10 '20 20:09

paone


People also ask

How does Spark overwrite mode work?

Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame.

What is SQL overwrite?

Description. The INSERT OVERWRITE statement overwrites the existing data in the table using the new values. The inserted rows can be specified by value expressions or result from a query.

How do I clear my Spark cache?

You can't clear cache in Spark 😔 You can reinstall the app.

What is Createorreplacetempview?

createorreplacetempview is used when you desire to store the table for a specific spark session. createorreplacetempview creates (or replaces if that view name already exists) a lazily evaluated "view" that you can then use like a hive table in Spark SQL.

What happens when you overwrite a table in SQL?

The overwrite mode first drops the table if it already exists in the database by default. Please use this option with due care to avoid unexpected data loss. When using mode overwrite if you do not use the option truncate on recreation of the table, indexes will be lost. , a columnstore table would now be a heap.

How to fix create table in overwrite mode fails when interrupted?

Or you can also try setting it at cluster level Spark configuration: Another option is to manually clean up the data directory specified in the error message. You can do this with " dbutils.fs.rm ". Please refer to this documentation which address this issue: Create table in overwrite mode fails when interrupted

Which version of Apache Spark is supported by SQL Server?

Supported Features Component Versions Supported Apache Spark 2.4.x, 3.0.x Scala 2.11, 2.12 Microsoft JDBC Driver for SQL Server 8.4 Microsoft SQL Server SQL Server 2008 or later 1 more rows ...

How fast is the Apache Spark connector for SQL Server?

Other bulk copy options can be set as options on the dataframe and will be passed to bulkcopy APIs on write Apache Spark Connector for SQL Server and Azure SQL is up to 15x faster than generic JDBC connector for writing to SQL Server. Performance characteristics vary on type, volume of data, options used, and may show run to run variations.


2 Answers

The SQL config 'spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation' was removed in the version 3.0.0. It was removed to prevent loosing of users data for non-default value.

like image 141
hopefulnick Avatar answered Oct 19 '22 17:10

hopefulnick


  • What are the differences between saveAsTable and insertInto in different SaveMode(s)?

Run following command to fix issue :

     dbutils.fs.rm("dbfs:/user/hive/warehouse/newtable/", true)

Or

Set the flag

spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation = true

spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")

like image 3
vaquar khan Avatar answered Oct 19 '22 17:10

vaquar khan