Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between invalidate metadata and refresh commands in Impala?

I saw at this link which affects Impala version 1.1:

Since Impala 1.1, REFRESH statement only works for existing tables. For new tables you need to issue "INVALIDATE METADATA" statement.

Does this still hold true for later versions of Impala?

like image 329
covfefe Avatar asked Feb 15 '17 01:02

covfefe


People also ask

Why we use invalidate metadata in Impala?

INVALIDATE METADATA is required when the following changes are made outside of Impala, in Hive and other Hive client, such as SparkSQL: Metadata of existing tables changes. New tables are added, and Impala will use the tables. The SERVER or DATABASE level Ranger privileges are changed.

What does refresh in Impala do?

The REFRESH statement reloads the metadata for the table from the metastore database and does an incremental reload of the file and block metadata from the HDFS NameNode. REFRESH is used to avoid inconsistencies between Impala and external metadata sources, namely Hive Metastore (HMS) and NameNodes.

What is used to refresh the Impala metadata?

To flush the metadata for all tables, use the INVALIDATE METADATA command. Because REFRESH table_name only works for tables that the current Impala node is already aware of, when you create a new table in the Hive shell, enter INVALIDATE METADATA new_table before you can see the new table in impala-shell.

How do I run invalidate metadata in Impala?

To flush the metadata for all tables, use the INVALIDATE METADATA command. Because REFRESH table_name only works for tables that the current Impala node is already aware of, when you create a new table in the Hive shell, enter INVALIDATE METADATA new_table before you can see the new table in impala-shell.


2 Answers

According to Cloudera's Impala guide (Cloudera Enterprise 5.8) but stayed the same for 5.9:

INVALIDATE METADATA and REFRESH are counterparts: INVALIDATE METADATA waits to reload the metadata when needed for a subsequent query, but reloads all the metadata for the table, which can be an expensive operation, especially for large tables with many partitions. REFRESH reloads the metadata immediately, but only loads the block location data for newly added data files, making it a less expensive operation overall. If data was altered in some more extensive way, such as being reorganized by the HDFS balancer, use INVALIDATE METADATA to avoid a performance penalty from reduced local reads. If you used Impala version 1.0, the INVALIDATE METADATA statement works just like the Impala 1.0 REFRESH statement did, while the Impala 1.1 REFRESH is optimized for the common use case of adding new data files to an existing table, thus the table name argument is now required.

and related to working on existing tables:

The table name is a required parameter [for REFRESH]. To flush the metadata for all tables, use the INVALIDATE METADATA command. Because REFRESH table_name only works for tables that the current Impala node is already aware of, when you create a new table in the Hive shell, enter INVALIDATE METADATA new_table before you can see the new table in impala-shell. Once the table is known by Impala, you can issue REFRESH table_name after you add data files for that table.

So it seems like it indeed stayed the same. I believe CDH 5.9 comes with Impala 2.7.

like image 64
spijs Avatar answered Oct 20 '22 11:10

spijs


As per Impala document Invalidate Metada and Refresh

INVALIDATE METADATA Statement

The INVALIDATE METADATA statement marks the metadata for one or all tables as stale. The next time the Impala service performs a query against a table whose metadata is invalidated, Impala reloads the associated metadata before the query proceeds. As this is a very expensive operation compared to the incremental metadata update done by the REFRESH statement, when possible, prefer REFRESH rather than INVALIDATE METADATA.

INVALIDATE METADATA is required when the following changes are made outside of Impala, in Hive and other Hive client, such as SparkSQL:

  • Metadata of existing tables changes.
  • New tables are added, and Impala will use the tables.
  • The SERVER or DATABASE level Sentry privileges are changed.
  • Block metadata changes, but the files remain the same (HDFS rebalance). UDF jars change.
  • Some tables are no longer queried, and you want to remove their metadata from the catalog and coordinator caches to reduce memory requirements.

No INVALIDATE METADATA is needed when the changes are made by impalad.

REFRESH Statement

The REFRESH statement reloads the metadata for the table from the metastore database and does an incremental reload of the file and block metadata from the HDFS NameNode. REFRESH is used to avoid inconsistencies between Impala and external metadata sources, namely Hive Metastore (HMS) and NameNodes.

Usage notes:

The table name is a required parameter, and the table must already exist and be known to Impala.

Only the metadata for the specified table is reloaded.

Use the REFRESH statement to load the latest metastore metadata for a particular table after one of the following scenarios happens outside of Impala:

  • Deleting, adding, or modifying files.
    For example, after loading new data files into the HDFS data directory for the table, appending to an existing HDFS file, inserting data from Hive via INSERT or LOAD DATA.
  • Deleting, adding, or modifying partitions.
    For example, after issuing ALTER TABLE or other table-modifying SQL statement in Hive
like image 35
KZapagol Avatar answered Oct 20 '22 12:10

KZapagol