Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Hive stores the data (loaded from HDFS)?

I am fairly new to Hadoop (HDFS and Hbase) and Hadoop Eco system (Hive, Pig, Impala etc.). I have got a good understanding of Hadoop components such as NamedNode, DataNode, Job Tracker, Task Tracker and how they work in tandem to store the data in efficient manner.

While trying to understand fundamentals of data access layer such as Hive, I need to understand where exactly a table’s data (created in Hive) gets stored? We can create external and internal table in Hive. As external tables can be in HDFS or any other file system, Hive doesnt store data for such tables in warehouse. What about internal tables? This table will be created as a directory on one of the data nodes on Hadoop Cluster. Once we load data in these tables from local or HDFS file system, are there further files getting created to store data in tables created in Hive?

Say for example:

  1. A sample file named test_emp_feedback.csv was brought from local file system to HDFS.
  2. A table (emp_feedback) was created in Hive with a structure similar to csv file structure. This lead to creation of a directory in Hadoop cluster say /users/big_data/hive/emp_feedback
  3. Now once I create the table and load data in emp_feedback table from test_emp_feedback.csv

Is Hive going to create a copy of file in emp_feedback directory? Wont it cause data redundancy?

like image 857
funsuk Avatar asked Sep 15 '25 20:09

funsuk


1 Answers

Creating a Managed table will create a directory with Same name as table name at Hive warehouse directory(Usually at /user/hive/warehouse/dbname/tablename).Also the table structure(Hive Metadata) is created in the metastore(RDBMS/HCat).

Before you load the data on the table, this directory(with the same name as table name under hive warehouse) is empty.

There could be 2 possible scenarios.

  1. If the table is external the data is not copied to warehouse directory at all.

  2. If the table is managed(not external), when you load your data to the table it is moved(not Copied) from current HDFS location to Hive warehouse directory9/user/hive/warehouse//). So this will not replicate the data.

Caution: It is always advisable to create external table unless the data is only used by hive. Dropping a managed table would delete the data from HDFS(Warehouse of HIVE).

HadoopGig

like image 75
Mufaddal Kamdar Avatar answered Sep 19 '25 05:09

Mufaddal Kamdar