Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark HiveContext does not retrieve newly inserted records from Hive Table

I am using Spark 1.4. HiveContext is used to connect Hive. I did the following

val hx = new HiveContext(sc)
import hx.implicits._
hx.sql("select * from tab").show

// it is fine, result was shown as expected

then, I inserted a few records into tab from beeline console

hx.refreshTable("tab")
hx.sql("select * from tab").show

// still old records, no newly inserted records

My question is: why the HiveContext didn't retrieve the newly inserted records?

like image 610
david2028 Avatar asked Jul 21 '15 15:07

david2028


1 Answers

hiveContext.refreshTable(tableName: String) - this will refresh only metadata of the table (not the actual data)

Notes from official documentaition : (credits: https://spark.apache.org)

refreshTable(tableName: String): Unit

Invalidate and refresh all the cached the metadata of the given table. For performance reasons, Spark SQL or the external data source library it uses might cache certain metadata about a table, such as the location of blocks. When those change outside of Spark SQL, users should call this function to invalidate the cache

To retrive newly inserted records:- uncache first and cache again using , uncacheTable(String tableName) and cacheTable(String tableName)

like image 97
vijay kumar Avatar answered Jan 04 '23 00:01

vijay kumar