I'd like to know how to find the mapping between Hive tables and the actual HDFS files (or rather, directories) that they represent. I need to access the table files directly.
Where does Hive store its files in HDFS?
It queries data stored in a distributed storage solution, like the Hadoop Distributed File System (HDFS) or Amazon S3. Hive stores its database and table metadata in a metastore, which is a database or file backed store that enables easy data abstraction and discovery.
The Hive metastore is simply a relational database. It stores metadata related to the tables/schemas you create to easily query big data stored in HDFS. When you create a new Hive table, the information related to the schema (column names, data types) is stored in the Hive metastore relational database.
By default, the location of warehouse is file:///user/hive/warehouse and we can also use hive-site. xml file for local or remote metastore.
It performs three steps internally: Compiler - The Hive driver passes the query to the compiler, where it is checked and analyzed. Optimizer - Optimized logical plan in the form of a graph of MapReduce and HDFS tasks is obtained. Executor - In the final step, the tasks are executed.
Hive tables may not necessarily be stored in a warehouse (since you can create tables located anywhere on the HDFS).
You should use DESCRIBE FORMATTED <table_name>
command.
hive -S -e "describe formatted <table_name> ;" | grep 'Location' | awk '{ print $NF }'
Please note that partitions may be stored in different places and to get the location of the alpha=foo/beta=bar
partition you'd have to add partition(alpha='foo',beta='bar')
after <table_name>
.
The location they are stored on the HDFS is fairly easy to figure out once you know where to look. :)
If you go to http://NAMENODE_MACHINE_NAME:50070/
in your browser it should take you to a page with a Browse the filesystem
link.
In the $HIVE_HOME/conf
directory there is the hive-default.xml
and/or hive-site.xml
which has the hive.metastore.warehouse.dir
property. That value is where you will want to navigate to after clicking the Browse the filesystem
link.
In mine, it's /usr/hive/warehouse
. Once I navigate to that location, I see the names of my tables. Clicking on a table name (which is just a folder) will then expose the partitions of the table. In my case, I currently only have it partitioned on date
. When I click on the folder at this level, I will then see files (more partitioning will have more levels). These files are where the data is actually stored on the HDFS.
I have not attempted to access these files directly, I'm assuming it can be done. I would take GREAT care if you are thinking about editing them. :) For me - I'd figure out a way to do what I need to without direct access to the Hive data on the disk. If you need access to raw data, you can use a Hive query and output the result to a file. These will have the exact same structure (divider between columns, ect) as the files on the HDFS
. I do queries like this all the time and convert them to CSVs.
The section about how to write data from queries to disk is https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries
UPDATE
Since Hadoop 3.0.0 - Alpha 1 there is a change in the default port numbers. NAMENODE_MACHINE_NAME:50070 changes to NAMENODE_MACHINE_NAME:9870. Use the latter if you are running on Hadoop 3.x. The full list of port changes are described in HDFS-9427
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With