I am using the below code to create a table from a dataframe in databricks and run into error.
df.write.saveAsTable("newtable")
This works fine the very first time but for re-usability if I were to rewrite like below
df.write.mode(SaveMode.Overwrite).saveAsTable("newtable")
I get the following error.
Error Message:
org.apache.spark.sql.AnalysisException: Can not create the managed table newtable. The associated location dbfs:/user/hive/warehouse/newtable already exists
Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame.
Description. The INSERT OVERWRITE statement overwrites the existing data in the table using the new values. The inserted rows can be specified by value expressions or result from a query.
You can't clear cache in Spark 😔 You can reinstall the app.
createorreplacetempview is used when you desire to store the table for a specific spark session. createorreplacetempview creates (or replaces if that view name already exists) a lazily evaluated "view" that you can then use like a hive table in Spark SQL.
The overwrite mode first drops the table if it already exists in the database by default. Please use this option with due care to avoid unexpected data loss. When using mode overwrite if you do not use the option truncate on recreation of the table, indexes will be lost. , a columnstore table would now be a heap.
Or you can also try setting it at cluster level Spark configuration: Another option is to manually clean up the data directory specified in the error message. You can do this with " dbutils.fs.rm ". Please refer to this documentation which address this issue: Create table in overwrite mode fails when interrupted
Supported Features Component Versions Supported Apache Spark 2.4.x, 3.0.x Scala 2.11, 2.12 Microsoft JDBC Driver for SQL Server 8.4 Microsoft SQL Server SQL Server 2008 or later 1 more rows ...
Other bulk copy options can be set as options on the dataframe and will be passed to bulkcopy APIs on write Apache Spark Connector for SQL Server and Azure SQL is up to 15x faster than generic JDBC connector for writing to SQL Server. Performance characteristics vary on type, volume of data, options used, and may show run to run variations.
The SQL config 'spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation' was removed in the version 3.0.0. It was removed to prevent loosing of users data for non-default value.
Run following command to fix issue :
dbutils.fs.rm("dbfs:/user/hive/warehouse/newtable/", true)
Or
Set the flag
spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation = true
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With