I have been going through some HIVE books and tutorials. One of the book - Hadoop in Practice says
When you create an external (unmanaged) table, Hive keeps the data in the directory specified by the LOCATION keyword intact. But if you were to execute the same CREATE command and drop the EXTERNAL keyword, the table would be a managed table, and Hive would move the contents of the LOCATION directory into /user/hive/ warehouse/stocks, which may not be the behavior you expect.
I created a managed table with LOCATION keyword. And then loaded data into the table from a HDFS file. But I could not see any directory created under /user/hive/warehouse. Rather the new directory was created in LOCATION mentioned. So I think if I create a MANAGED table with LOCATION mentioned then there is nothing created in Hive warehouse directory ? Is this understanding correct ?
Also if the location of the input file during LOAD command is hdfs, then internal or external table both will move the data to their location. Is this understanding also correct ?
A managed table is stored under the hive. metastore. warehouse. dir path property, by default in a folder path similar to /user/hive/warehouse/databasename.
Absolutely, by using the LOCATION keyword, we can change the default location of Managed tables while creating the managed table in Hive.
The main difference between a managed and external table is that when you drop an external table, the underlying data files stay intact. This is because the user is expected to manage the data files and directories. With a managed table, the underlying directories and data get wiped out when the table is dropped.
We can identify the internal or External tables using the DESCRIBE FORMATTED table_name statement in the Hive, which will display either MANAGED_TABLE or EXTERNAL_TABLE depending on the table type.
In both cases(managed or external) Location is optional so whenever you specify LOCATION data will be stored on the same HDFC LOCATION PATH irrespective of which table you are creating(managed or external). And, if you don't use LOCATION, default location path which is mentioned in hive-site.xml is considered.
First of all when you create a managed table
with location
keyword, it does not create a directory at the specified location
, rather it will give you an Exception:FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:hdfs://path/of/the/given/location is not a directory or unable to create one)
.
This means that in the DDL, the location
that you have given, first needs the directory to be present, else the above given Exception will be thrown.
Next you can create the DDL with location
given.
Next you can use a select * from <table>
command to view the data (without having to load data).
But when you drop this table, your data is also gone from hdfs (unlike External tables) and also gone is the metadata.
This is the primary difference between a managed table with location
keyword. It behaves partly like external table, and partly like managed table.
External, as in, you dont have to load the data, and you just specify the location.
Managed, as in, you drop the table, the data is also deleted.
Hope that makes sense.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With