When to execute REFRESH TABLE my_table in spark?

Tags:

Consider a code;

 import org.apache.spark.sql.hive.orc._
 import org.apache.spark.sql._

 val path = ...
 val dataFrame:DataFramew = ...

 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
 dataFrame.createOrReplaceTempView("my_table")
 val results = hiveContext.sql(s"select * from my_table")
 results.write.mode(SaveMode.Append).partitionBy("my_column").format("orc").save(path)
 hiveContext.sql("REFRESH TABLE my_table")

This code is executed twice with same path but different dataFrames. The first run is successful, but subsequent rise en error:

Caused by: java.io.FileNotFoundException: File does not exist: hdfs://somepath/somefile.snappy.orc
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

I have tried to clean up cache, invoke hiveContext.dropTempTable("tableName") and all have no effect. When to call REFRESH TABLE tableName before, after (other variants) to repair such error?

543

asked Mar 12 '18 11:03

Cherry

1 Answers

You can run spark.catalog.refreshTable(tableName) or spark.sql(s"REFRESH TABLE $tableName") just before the write operation. I had same problem and it fixed my problem.

spark.catalog.refreshTable(tableName)
df.write.mode(SaveMode.Overwrite).insertInto(tableName)

179

answered Nov 10 '22 22:11

dnzprmksz

Related questions
                            
                                Spark cluster Master IP address not binding to floating IP
                            
                                Zeppelin - Cannot query with %sql a table I registered with pyspark
                            
                                Not able to retrieve data from SparkR created DataFrame
                            
                                com.fasterxml.jackson.databind.JsonMappingException: Jackson version is too old 2.5.3
                            
                                Bulk data migration through Spark SQL
                            
                                SparkSQL on HBase Tables
                            
                                Does spark keep all elements of an RDD[K,V] for a particular key in a single partition after "groupByKey" even if the data for a key is very huge?
                            
                                Spark 2.0 memory fraction
                            
                                Spark : Size exceeds Integer.MAX_VALUE When Joining 2 Large DFs
                            
                                Multiple constructors with the same number of parameters exception while transforming data in spark using scala
                            
                                Changing column data type to factor with sparklyr
                            
                                Spark GraphX Aggregation Summation
                            
                                Spark exception with java.lang.ClassNotFoundException: de.unkrig.jdisasm.Disassembler
                            
                                How to deserialize records from Kafka using Structured Streaming in Java?
                            
                                object DataFrame is not a member of package org.apache.spark.sql
                            
                                Are Spark executors multi-threaded?
                            
                                spark worker with 32GB or more memory encountered a fatal error
                            
                                Why Mongo Spark connector returns different and incorrect counts for a query?
                            
                                Spark Error : executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
                            
                                How does Pyspark Calculate Doc2Vec from word2vec word embeddings?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When to execute REFRESH TABLE my_table in spark?

Tags:

apache-spark

apache-spark-sql

hive

Cherry

People also ask

1 Answers

dnzprmksz

Recent Activity

Donate For Us