Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive Managed Table vs External Table : LOCATION directory

Tags:

hadoop

hive

I have been going through some HIVE books and tutorials. One of the book - Hadoop in Practice says

When you create an external (unmanaged) table, Hive keeps the data in the directory specified by the LOCATION keyword intact. But if you were to execute the same CREATE command and drop the EXTERNAL keyword, the table would be a managed table, and Hive would move the contents of the LOCATION directory into /user/hive/ warehouse/stocks, which may not be the behavior you expect.

I created a managed table with LOCATION keyword. And then loaded data into the table from a HDFS file. But I could not see any directory created under /user/hive/warehouse. Rather the new directory was created in LOCATION mentioned. So I think if I create a MANAGED table with LOCATION mentioned then there is nothing created in Hive warehouse directory ? Is this understanding correct ?

Also if the location of the input file during LOAD command is hdfs, then internal or external table both will move the data to their location. Is this understanding also correct ?

like image 861
user1060418 Avatar asked Jul 09 '15 07:07

user1060418


People also ask

Where are Hive managed tables stored?

A managed table is stored under the hive. metastore. warehouse. dir path property, by default in a folder path similar to /user/hive/warehouse/databasename.

Is it possible to change the default location of managed tables in Hive if so how?

Absolutely, by using the LOCATION keyword, we can change the default location of Managed tables while creating the managed table in Hive.

Which of these is the difference between a managed and an external table?

The main difference between a managed and external table is that when you drop an external table, the underlying data files stay intact. This is because the user is expected to manage the data files and directories. With a managed table, the underlying directories and data get wiped out when the table is dropped.

How do you know if a table is managed table or external table in Hive?

We can identify the internal or External tables using the DESCRIBE FORMATTED table_name statement in the Hive, which will display either MANAGED_TABLE or EXTERNAL_TABLE depending on the table type.


2 Answers

In both cases(managed or external) Location is optional so whenever you specify LOCATION data will be stored on the same HDFC LOCATION PATH irrespective of which table you are creating(managed or external). And, if you don't use LOCATION, default location path which is mentioned in hive-site.xml is considered.

like image 189
Ravindra Phule Avatar answered Nov 15 '22 07:11

Ravindra Phule


First of all when you create a managed table with location keyword, it does not create a directory at the specified location, rather it will give you an Exception:FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:hdfs://path/of/the/given/location is not a directory or unable to create one).
This means that in the DDL, the location that you have given, first needs the directory to be present, else the above given Exception will be thrown.
Next you can create the DDL with location given.
Next you can use a select * from <table> command to view the data (without having to load data).
But when you drop this table, your data is also gone from hdfs (unlike External tables) and also gone is the metadata.
This is the primary difference between a managed table with location keyword. It behaves partly like external table, and partly like managed table.
External, as in, you dont have to load the data, and you just specify the location.
Managed, as in, you drop the table, the data is also deleted.
Hope that makes sense.

like image 39
aiman Avatar answered Nov 15 '22 07:11

aiman