Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where does Hive store files in HDFS?

Tags:

hadoop

hive

hdfs

I'd like to know how to find the mapping between Hive tables and the actual HDFS files (or rather, directories) that they represent. I need to access the table files directly.

Where does Hive store its files in HDFS?

like image 346
Yuval Avatar asked Feb 20 '11 16:02

Yuval


People also ask

Where does hive store data in HDFS?

It queries data stored in a distributed storage solution, like the Hadoop Distributed File System (HDFS) or Amazon S3. Hive stores its database and table metadata in a metastore, which is a database or file backed store that enables easy data abstraction and discovery.

Is hive Metastore stored in HDFS?

The Hive metastore is simply a relational database. It stores metadata related to the tables/schemas you create to easily query big data stored in HDFS. When you create a new Hive table, the information related to the schema (column names, data types) is stored in the Hive metastore relational database.

Where is hive Metastore stored?

By default, the location of warehouse is file:///user/hive/warehouse and we can also use hive-site. xml file for local or remote metastore.

How does hive work with HDFS?

It performs three steps internally: Compiler - The Hive driver passes the query to the compiler, where it is checked and analyzed. Optimizer - Optimized logical plan in the form of a graph of MapReduce and HDFS tasks is obtained. Executor - In the final step, the tasks are executed.


2 Answers

Hive tables may not necessarily be stored in a warehouse (since you can create tables located anywhere on the HDFS).

You should use DESCRIBE FORMATTED <table_name> command.

hive -S -e "describe formatted <table_name> ;" | grep 'Location' | awk '{ print $NF }' 

Please note that partitions may be stored in different places and to get the location of the alpha=foo/beta=bar partition you'd have to add partition(alpha='foo',beta='bar') after <table_name>.

like image 160
Just Avatar answered Sep 22 '22 15:09

Just


The location they are stored on the HDFS is fairly easy to figure out once you know where to look. :)

If you go to http://NAMENODE_MACHINE_NAME:50070/ in your browser it should take you to a page with a Browse the filesystem link.

In the $HIVE_HOME/conf directory there is the hive-default.xml and/or hive-site.xml which has the hive.metastore.warehouse.dir property. That value is where you will want to navigate to after clicking the Browse the filesystem link.

In mine, it's /usr/hive/warehouse. Once I navigate to that location, I see the names of my tables. Clicking on a table name (which is just a folder) will then expose the partitions of the table. In my case, I currently only have it partitioned on date. When I click on the folder at this level, I will then see files (more partitioning will have more levels). These files are where the data is actually stored on the HDFS.

I have not attempted to access these files directly, I'm assuming it can be done. I would take GREAT care if you are thinking about editing them. :) For me - I'd figure out a way to do what I need to without direct access to the Hive data on the disk. If you need access to raw data, you can use a Hive query and output the result to a file. These will have the exact same structure (divider between columns, ect) as the files on the HDFS. I do queries like this all the time and convert them to CSVs.

The section about how to write data from queries to disk is https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries

UPDATE

Since Hadoop 3.0.0 - Alpha 1 there is a change in the default port numbers. NAMENODE_MACHINE_NAME:50070 changes to NAMENODE_MACHINE_NAME:9870. Use the latter if you are running on Hadoop 3.x. The full list of port changes are described in HDFS-9427

like image 42
QuinnG Avatar answered Sep 22 '22 15:09

QuinnG