Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to point one Hive Table to Multiple External Files?

Tags:

hadoop

hive

hdfs

I would like to be able to append multiple HDFS files to one Hive table while leaving the HDFS files in their original directory. These files are created are located in different directories.

The LOAD DATA INPATH moves the HDFS file to the hive warehouse directory.

As far as I can tell, an External Table must be pointed to one file, or to one directory within which multiple files with the same schema can be placed. However, my files would not be underneath a single directory.

Is it possible to point a single Hive table to multiple external files in separate directories, or to otherwise copy multiple files into a single hive table without moving the files from their original HDFS location?

Expanded Solution off of Pradeep's answer:

For example, my files look like this:

/root_directory/<job_id>/input/<dt>

Pretend the schema of each is (foo STRING, bar STRING, job_id STRING, dt STRING)

I first create an external table. However, note that my DDL does not contain an initial location, and it does not include the job_id and dt fields:

CREATE EXTERNAL TABLE hivetest (
foo STRING,
bar STRING
) PARTITIONED BY (job_id STRING, dt STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
;

Let's say I have two files I wish to insert located at:

/root_directory/b1/input/2014-01-01
/root_directory/b2/input/2014-01-02

I can load these two external files into the same Hive table like so:

ALTER TABLE hivetest 
ADD PARTITION(job_id = 'b1', dt='2014-01-01')
LOCATION '/root_directory/b1/input/2014-01-01';

ALTER TABLE hivetest 
ADD PARTITION(job_id = 'b2', dt='2014-01-02')
LOCATION '/root_directory/b2/input/2014-01-02';

If anyone happens to require the use of Talend to perform this, they can use the tHiveLoad component like so [edit: This doesn't work; check below]:

Matthew Moisen Hive Talend Insert

The code talend produces for this using tHiveLoad is actually LOAD DATA INPATH ..., which will remove the file off its original location in HDFS.

You will have to do the earlier ALTER TABLE syntax in a tHiveLoad instead.

like image 436
Matthew Moisen Avatar asked Mar 17 '23 03:03

Matthew Moisen


1 Answers

The short answer is yes. A Hive External Table can be pointed to multiple files/directories. The long answer will depend on the directory structure of your data. The typical way you do this is to create a partitioned table with the partition columns mapping to some part of your directory path.

E.g. We have a use case where an external table points to thousands of directories on HDFS. Our paths conform to this pattern /prod/${customer-id}/${date}/. In each of these directories we have approx 100 files. In mapping this into a Hive Table, we created two partition columns, customer_id and date. So every day, we're able to load the data into Hive, by doing

ALTER TABLE x ADD PARTITION (customer_id = "blah", dt = "blah_date") LOCATION '/prod/blah/blah_date';
like image 151
Pradeep Gollakota Avatar answered Mar 24 '23 09:03

Pradeep Gollakota