External Hive Table Refresh table vs MSCK Repair

Tags:

I have external hive table stored as Parquet, partitioned on a column say as_of_dt and data gets inserted via spark streaming. Now Every day new partition get added. I am doing msck repair table so that the hive metastore gets the newly added partition info. Is this the only way or is there a better way? I am concerned if downstream users querying the table, will msck repair cause any issue in non availability of data or stale data? I was going through the HiveContext API and see refreshTable option. Any idea if this makes sense to use refreshTable instead ?

955

asked Aug 06 '18 17:08

Ajith Kannan

2 Answers

To directly answer your question msck repair table, will check if partitions for a table is active. Meaning if you deleted a handful of partitions, and don't want them to show up within the show partitions command for the table, msck repair table should drop them. Msck repair could take more time than an invalidate or refresh statement, however Invalidate Metadata only runs within Hive updating only the Hive Metastore. Refresh runs only in Spark SQL and updates the Spark metadata store.

Hive metastore should be fine if you are completing the add partition step somewhere in the processing, however if you ever want to access the hive table through Spark SQL you will need to update the metadata through Spark (or Impala or another process that updates the spark metadata).

Anytime you update or change the contents of a hive table, the Spark metastore can fall out of sync, causing you to be unable to query the data through the spark.sql command set. Meaning if you want to query that data you need to keep the Spark metastore in sync.

If you have a Spark version that allows for it, you should refresh and add partitions to Hive tables within Spark, so all metastores are in sync. Below is how I do it:

//Non-Partitioned Table
outputDF.write.format("parquet").mode("overwrite").load(fileLocation)
spark.sql("refresh table " + tableName)

//Partitioned Table
outputDF.write.format("parquet").mode("overwrite").load(fileLocation + "/" + partition)
val addPartitionsStatement = "alter table" + tableName = " add if not exists partition(partitionKey='" + partition + "') location '" + fileLocation + "/" + partition + "'"
spark.sql(addPartitionsStatement)
spark.sql("refresh table " + tableName)

152

answered Oct 05 '22 16:10

afeldman

It looks like refreshTable does refresh the cached metadata, not affecting Hive metadata.

Doc says:

Invalidate and refresh all the cached the metadata of the given table. For performance reasons, Spark SQL or the external data source library it uses might cache certain metadata about a table, such as the location of blocks. When those change outside of Spark SQL, users should call this function to invalidate the cache.

Method does not update Hive metadata, so repair is necessary.

answered Oct 05 '22 16:10

leftjoin

Related questions
                            
                                Separating application logs in Logback from Spark Logs in log4j
                            
                                Why is predicate pushdown not used in typed Dataset API (vs untyped DataFrame API)?
                            
                                PySpark vs sklearn TFIDF
                            
                                How far will Spark RDD cache go?
                            
                                Zip support in Apache Spark
                            
                                AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'>
                            
                                Spark runs out of memory when grouping by key
                            
                                How to upgrade Spark to newer version?
                            
                                Spark case class - decimal type encoder error "Cannot up cast from decimal"
                            
                                Read all Parquet files saved in a folder via Spark
                            
                                How to use first and last function in pyspark?
                            
                                How to save a huge pandas dataframe to hdfs?
                            
                                how to pass python package to spark job and invoke main file from package with arguments
                            
                                scala vs java for Spark? [closed]
                            
                                Spark jobs finishes but application takes time to close
                            
                                Is foreachRDD executed on the Driver?
                            
                                Add one more StructField to schema
                            
                                Loading compressed gzipped csv file in Spark 2.0
                            
                                What is StringIndexer , VectorIndexer, and how to use them?
                            
                                Mapping Spark DataSet row values into new hash column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

External Hive Table Refresh table vs MSCK Repair

Tags:

apache-spark

hive

hive-partitions

hivecontext

Ajith Kannan

People also ask

2 Answers

afeldman

leftjoin

Recent Activity

Donate For Us