As has been discussed in impala tutorials, Impala uses a Metastore shared by Hive. but has been mentioned that if you create or do some editions on tables using hive, you should execute INVALIDATE METADATA
or REFRESH
command to inform impala about changes.
So I've got confused and my question is: if the Database of Metadata is shared, why there is a need for executing INVALIDATE METADATA
or REFRESH
by impala?
and if it is for caching of metadata by impala, why the daemons do not update their cache in the occurrence of cache miss themselves and without need to refresh metadata manually?
any help is appreciated.
Ok! Let's start with your question in the comment that what is the benefit of a centralized meta store.
Having a central meta store don't require the user to maintain meta data at two different locations, one each for Hive and Impala. User can have a central repository and both the tools can access this location for any metadata information.
Now, the second part, why there is a need to do INVALIDATE METADATA or REFRESH when the meta store is shared?
Impala utilizes Massively Parallel Processing paradigm to get the work done. Instead of reading from the centralized meta store for each and every query, it tends to keep the metadata with executor nodes so that it can completely bypass the COLD STARTS where a significant amount of time may be spent in reading the metadata.
INVALIDATE METADATA/REFRESH propagates the metadata/block information to the executor nodes.
Why do it manually?
In the earlier version of Impala, catalogd process was not present. The meta data updates were need to be propagated via the aforementioned commands. Starting Impala 1.2, catalogd is added and this process relays the metadata changes from Impala SQL statements to all the nodes in a cluster.
Hence removing the need to do it manually!
Hope that helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With