I am reading a Hive table using Spark SQL and assigning it to a scala val
val x = sqlContext.sql("select * from some_table")
Then I am doing some processing with the dataframe x and finally coming up with a dataframe y , which has the exact schema as the table some_table.
Finally I am trying to insert overwrite the y dataframe to the same hive table some_table
y.write.mode(SaveMode.Overwrite).saveAsTable().insertInto("some_table")
Then I am getting the error
org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from
I tried creating an insert sql statement and firing it using sqlContext.sql() but it too gave me the same error.
Is there any way I can bypass this error? I need to insert the records back to the same table.
Hi I tried doing as suggested , but still getting the same error .
val x = sqlContext.sql("select * from incremental.test2")
val y = x.limit(5)
y.registerTempTable("temp_table")
val dy = sqlContext.table("temp_table")
dy.write.mode("overwrite").insertInto("incremental.test2")
scala> dy.write.mode("overwrite").insertInto("incremental.test2")
org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from.;
Spark SQL also supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them automatically.
The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame .
Actually you can also use checkpointing to achieve this. Since it breaks data lineage, Spark is not able to detect that you are reading and overwriting in the same table:
sqlContext.sparkContext.setCheckpointDir(checkpointDir)
val ds = sqlContext.sql("select * from some_table").checkpoint()
ds.write.mode("overwrite").saveAsTable("some_table")
You should first save your DataFrame y
in a temporary table
y.write.mode("overwrite").saveAsTable("temp_table")
Then you can overwrite rows in your target table
val dy = sqlContext.table("temp_table")
dy.write.mode("overwrite").insertInto("some_table")
You should first save your DataFrame y
like a parquet file:
y.write.parquet("temp_table")
After you load this like:
val parquetFile = sqlContext.read.parquet("temp_table")
And finish you insert your data in your table
parquetFile.write.insertInto("some_table")
In context to Spark 2.2
'spark.sql.partitionProvider' 'spark.sql.sources.provider' 'spark.sql.sources.schema.numPartCols 'spark.sql.sources.schema.numParts' 'spark.sql.sources.schema.part.0' 'spark.sql.sources.schema.part.1' 'spark.sql.sources.schema.part.2' 'spark.sql.sources.schema.partCol.0' 'spark.sql.sources.schema.partCol.1'
https://querydb.blogspot.com/2019/07/read-from-hive-table-and-write-back-to.html
when we upgraded our HDP to 2.6.3 The Spark was updated from 2.2 to 2.3 which resulted in below error -
Caused by: org.apache.spark.sql.AnalysisException: Cannot overwrite a path that is also being read from.;
at org.apache.spark.sql.execution.command.DDLUtils$.verifyNotReadPath(ddl.scala:906)
This error occurs for job where-in we are reading and writing to same path. Like Jobs with SCD Logic
Solution -
https://querydb.blogspot.com/2020/09/orgapachesparksqlanalysisexception.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With