Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between Hive internal tables and external tables?

People also ask

What is the difference between external table and managed table?

The main difference between a managed and external table is that when you drop an external table, the underlying data files stay intact. This is because the user is expected to manage the data files and directories. With a managed table, the underlying directories and data get wiped out when the table is dropped.

What is an external table in Hive?

An external table is a table for which Hive does not manage storage. If you delete an external table, only the definition in Hive is deleted. The data remains. An internal table is a table that Hive manages. If you delete an internal table, both the definition in Hive and the data are deleted.

What are the different types of tables in Hive?

There are 2 types of tables in Hive, Internal and External.

What is Hive internal table?

Internal tables are also known as Managed tables that are owned and managed by Hive. By default, Hive creates a table as an Internal table and owned the table structure and the files. In other words, Hive completely manages the lifecycle of the table (metadata & data) similar to tables in RDBMS.


Hive has a relational database on the master node it uses to keep track of state. For instance, when you CREATE TABLE FOO(foo string) LOCATION 'hdfs://tmp/';, this table schema is stored in the database.

If you have a partitioned table, the partitions are stored in the database(this allows hive to use lists of partitions without going to the file-system and finding them, etc). These sorts of things are the 'metadata'.

When you drop an internal table, it drops the data, and it also drops the metadata.

When you drop an external table, it only drops the meta data. That means hive is ignorant of that data now. It does not touch the data itself.


Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed.

Use EXTERNAL tables when:

  1. The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn't lock the files.
  2. Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
  3. You want to use a custom location such as ASV.
  4. Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.
  5. You are not creating table based on existing table (AS SELECT).

Use INTERNAL tables when:

The data is temporary.

You want Hive to completely manage the lifecycle of the table and data.


To answer you Question :

For External Tables, Hive stores the data in the LOCATION specified during creation of the table(generally not in warehouse directory). If the external table is dropped, then the table metadata is deleted but not the data.

For Internal tables, Hive stores data into its warehouse directory. If the table is dropped then both the table metadata and the data will be deleted.


For your reference,

Difference between Internal & External tables :

For External Tables -

  • External table stores files on the HDFS server but tables are not linked to the source file completely.

  • If you delete an external table the file still remains on the HDFS server.

    As an example if you create an external table called “table_test” in HIVE using HIVE-QL and link the table to file “file”, then deleting “table_test” from HIVE will not delete “file” from HDFS.

  • External table files are accessible to anyone who has access to HDFS file structure and therefore security needs to be managed at the HDFS file/folder level.

  • Meta data is maintained on master node, and deleting an external table from HIVE only deletes the metadata not the data/file.


For Internal Tables-

  • Stored in a directory based on settings in hive.metastore.warehouse.dir, by default internal tables are stored in the following directory “/user/hive/warehouse” you can change it by updating the location in the config file .
  • Deleting the table deletes the metadata and data from master-node and HDFS respectively.
  • Internal table file security is controlled solely via HIVE. Security needs to be managed within HIVE, probably at the schema level (depends on organization).

Hive may have internal or external tables, this is a choice that affects how data is loaded, controlled, and managed.

Use EXTERNAL tables when:

  • The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
  • Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schema (tables or views) at a single data set or if you are iterating through various possible schema.
  • Hive should not own data and control settings, directories, etc., you may have another program or process that will do those things.
  • You are not creating table based on existing table (AS SELECT).

Use INTERNAL tables when:

  • The data is temporary.
  • You want Hive to completely manage the life-cycle of the table and data.

Source :

HDInsight: Hive Internal and External Tables Intro

Internal & external tables in Hadoop- HIVE


An internal table data is stored in the warehouse folder, whereas an external table data is stored at the location you mentioned in table creation.

So when you delete an internal table, it deletes the schema as well as the data under the warehouse folder, but for an external table it's only the schema that you will loose.

So when you want an external table back you again after deleting it, can create a table with the same schema again and point it to the original data location. Hope it is clear now.


The only difference in behaviour (not the intended usage) based on my limited research and testing so far (using Hive 1.1.0 -cdh5.12.0) seems to be that when a table is dropped

  • the data of the Internal (Managed) tables gets deleted from the HDFS file system
  • while the data of the External tables does NOT get deleted from the HDFS file system.

(NOTE: See Section 'Managed and External Tables' in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL which list some other difference which I did not completely understand)

I believe Hive chooses the location where it needs to create the table based on the following precedence from top to bottom

  1. Location defined during the Table Creation
  2. Location defined in the Database/Schema Creation in which the table is created.
  3. Default Hive Warehouse Directory (Property hive.metastore.warehouse.dir in hive.site.xml)

When the "Location" option is not used during the "creation of a hive table", the above precedence rule is used. This is applicable for both Internal and External tables. This means an Internal table does not necessarily have to reside in the Warehouse directory and can reside anywhere else.

Note: I might have missed some scenarios, but based on my limited exploration, the behaviour of both Internal and Extenal table seems to be the same except for the one difference (data deletion) described above. I tried the following scenarios for both Internal and External tables.

  1. Creating table with and without Location option
  2. Creating table with and without Partition Option
  3. Adding new data using the Hive Load and Insert Statements
  4. Adding data files to the Table location outside of Hive (using HDFS commands) and refreshing the table using the "MSCK REPAIR TABLE command
  5. Dropping the tables