I'd like to know how to find the mapping between Hive tables and the actual HDFS files (or rather, directories) that they represent. I need to access the table files directly. Where does Hive store its files in HDFS?

Hive tables may not necessarily be stored in a warehouse (since you can create tables located anywhere on the HDFS). You should use <code>DESCRIBE FORMATTED <table_name></code> command. <pre class="prettyprint"><code>hive -S -e "describe formatted <table_name> ;" | grep 'Location' | awk '{ print $NF }' </code></pre> Please note that partitions may be stored in different places and to get the location of the <code>alpha=foo/beta=bar</code> partition you'd have to add <code>partition(alpha='foo',beta='bar')</code> after <code><table_name></code>.

The location they are stored on the HDFS is fairly easy to figure out once you know where to look. :) If you go to <code>http://NAMENODE_MACHINE_NAME:50070/</code> in your browser it should take you to a page with a <code>Browse the filesystem</code> link. In the <code>$HIVE_HOME/conf</code> directory there is the <code>hive-default.xml</code> and/or <code>hive-site.xml</code> which has the <code>hive.metastore.warehouse.dir</code> property. That value is where you will want to navigate to after clicking the <code>Browse the filesystem</code> link. In mine, it's <code>/usr/hive/warehouse</code>. Once I navigate to that location, I see the names of my tables. Clicking on a table name (which is just a folder) will then expose the partitions of the table. In my case, I currently only have it partitioned on <code>date</code>. When I click on the folder at this level, I will then see files (more partitioning will have more levels). These files are where the data is actually stored on the HDFS. I have not attempted to access these files directly, I'm assuming it can be done. I would take GREAT care if you are thinking about editing them. :) For me - I'd figure out a way to do what I need to without direct access to the Hive data on the disk. If you need access to raw data, you can use a Hive query and output the result to a file. These will have the exact same structure (divider between columns, ect) as the files on the <code>HDFS</code>. I do queries like this all the time and convert them to CSVs. The section about how to write data from queries to disk is https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries UPDATE Since Hadoop 3.0.0 - Alpha 1 there is a change in the default port numbers. NAMENODE_MACHINE_NAME:50070 changes to NAMENODE_MACHINE_NAME:9870. Use the latter if you are running on Hadoop 3.x. The full list of port changes are described in HDFS-9427

Where does Hive store files in HDFS?

2 Answers

Hive tables may not necessarily be stored in a warehouse (since you can create tables located anywhere on the HDFS).

You should use DESCRIBE FORMATTED <table_name> command.

hive -S -e "describe formatted <table_name> ;" | grep 'Location' | awk '{ print $NF }'

Please note that partitions may be stored in different places and to get the location of the alpha=foo/beta=bar partition you'd have to add partition(alpha='foo',beta='bar') after <table_name>.

160

answered Sep 22 '22 15:09

Just

The location they are stored on the HDFS is fairly easy to figure out once you know where to look. :)

If you go to http://NAMENODE_MACHINE_NAME:50070/ in your browser it should take you to a page with a Browse the filesystem link.

In the $HIVE_HOME/conf directory there is the hive-default.xml and/or hive-site.xml which has the hive.metastore.warehouse.dir property. That value is where you will want to navigate to after clicking the Browse the filesystem link.

In mine, it's /usr/hive/warehouse. Once I navigate to that location, I see the names of my tables. Clicking on a table name (which is just a folder) will then expose the partitions of the table. In my case, I currently only have it partitioned on date. When I click on the folder at this level, I will then see files (more partitioning will have more levels). These files are where the data is actually stored on the HDFS.

I have not attempted to access these files directly, I'm assuming it can be done. I would take GREAT care if you are thinking about editing them. :) For me - I'd figure out a way to do what I need to without direct access to the Hive data on the disk. If you need access to raw data, you can use a Hive query and output the result to a file. These will have the exact same structure (divider between columns, ect) as the files on the HDFS. I do queries like this all the time and convert them to CSVs.

The section about how to write data from queries to disk is https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries

UPDATE

Since Hadoop 3.0.0 - Alpha 1 there is a change in the default port numbers. NAMENODE_MACHINE_NAME:50070 changes to NAMENODE_MACHINE_NAME:9870. Use the latter if you are running on Hadoop 3.x. The full list of port changes are described in HDFS-9427

answered Sep 22 '22 15:09

QuinnG

Related questions
                            
                                Life without JOINs... understanding, and common practices
                            
                                Stop Java Coffee Cup icon from appearing in the Dock on Mac OSX
                            
                                How to access s3a:// files from Apache Spark?
                            
                                Hadoop cluster setup - java.net.ConnectException: Connection refused
                            
                                out of Memory Error in Hadoop
                            
                                HDFS free space available command
                            
                                How to fix corrupt HDFS FIles
                            
                                Hive cluster by vs order by vs sort by
                            
                                Why is there no 'hadoop fs -head' shell command?
                            
                                Hive insert query like SQL
                            
                                Write to multiple outputs by key Spark - one Spark job
                            
                                Hive: how to show all partitions of a table?
                            
                                HDFS error: could only be replicated to 0 nodes, instead of 1
                            
                                Integration testing Hive jobs
                            
                                How to Delete a directory from Hadoop cluster which is having comma(,) in its name?
                            
                                Differences between Amazon S3 and S3n in Hadoop
                            
                                How to delete and update a record in Hive
                            
                                What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
                            
                                Is there any way to get the column name along with the output while execute any query in Hive?
                            
                                Buiding Hadoop with Eclipse / Maven - Missing artifact jdk.tools:jdk.tools:jar:1.6

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Where does Hive store files in HDFS?

Tags:

hadoop

hive

hdfs

Yuval

People also ask

2 Answers

Just

QuinnG

Recent Activity

Donate For Us