Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop - Create external table from multiple directories in HDFS

I have an external table that reads data from the HDFS location (/user/hive/warehouse/tableX) all files and created a external table in Hive.

Now, let's assume there's some pre-partitioning of the data and all the previous files are spitted in several directories with a specific name convention <dir_name>_<incNumber> e.g.

/user/hive/warehouse/split/
  ./dir_1/files...
  ./dir_2/files...
  ./dir_n/files...

how can I create another external table that keeps track of all files in the split folder?

Do I need to create an external table that is partitioned on each sub-folder (dir_x)?

Also for that, is it needed some kind of Hive or shell script that can create/add a partition for each sub-directory?

like image 697
Alg_D Avatar asked Jun 01 '16 20:06

Alg_D


People also ask

Can we have bucketing for external tables?

Yes, Hive does support bucketing and partitioning for external tables.

What is external table in HDFS?

You use an external table, which is a table that Hive does not manage, to import data from a file on a file system, into Hive. In contrast to the Hive managed table, an external table keeps its data outside the Hive metastore. Hive metastore stores only the schema metadata of the external table.


2 Answers

You have to create an external table partitioned by dir_x to access all files in multiple folders.

CREATE external TABLE sample_table( col1 string,
                                    col2 string,
                                    col3 string,
                                    col4 string)
PARTITIONED BY (dir string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/split';

Then add the partition as you would to a regular partitioned table

ALTER TABLE sample_table ADD PARTITION(dir='dir_1')
LOCATION '/user/hive/warehouse/split/dir_1';
ALTER TABLE sample_table ADD PARTITION(dir='dir_2')
LOCATION '/user/hive/warehouse/split/dir_2';

This approach will work. There is an issue with this approach. If some time in the future you decide to add a new folder(e.g. dir_100) to the hive warehouse path, you will have to drop and recreate sample_table and re add all the partitions to sample_table again using ALTER TABLE statement. I haven't worked with hive for about 10 months now, so i am not sure if there is a better approach. If this is not an issue, you can use this approach.

like image 185
dheee Avatar answered Sep 26 '22 07:09

dheee


Once you have your external table in place and you know the partition structure that hive would create. You can load the data files (into pre-partitioned folders) in your drive through some process. Once the pre-partitioned file is available (as per hive known partition structure), you can repair the table as below.

hive> MSCK REPAIR TABLE sample_table; 

The above statement will automatically sync up all your existing partitions to the hive meta store of the existing EXTERNAL TABLE "sample_table".

like image 27
Srisatya Avatar answered Sep 22 '22 07:09

Srisatya