Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive: Table creation with multi-files with multiple directories

Tags:

hadoop

hive

I want to create a Hive table where the input textfiles are traversed onto multiple sub-directories in hdfs. So example I have in hdfs:

    /testdata/user/Jan/part-0001
    /testdata/user/Feb/part-0001
    /testdata/user/Mar/part-0001
and so on...

If i want to create a table user in hive, but have it be able to traverse the sub-directories of user, can that be done? I tried something like this, but doesn't work;

CREATE EXTERNAL TABLE users (id int, name string) 
STORED AS TEXTFILE LOCATION '/testdata/user/*'  

I thought adding the wildcard would work but doesn't. When I tried not using wildcard still does not work. However, if I copy the files into the root directory of user, then it works. Is there no way for Hive to traverse to the child-directories, and grab those files?

like image 226
user706794 Avatar asked Jan 27 '12 20:01

user706794


People also ask

Can we create multiple tables in Hive for a data file?

Yes, we can have multiple hive tables with the same underlying HDFS directory.

What is the disadvantages of using too many partitions in Hive tables?

Limitations: Having large number of partitions create number of files/ directories in HDFS, which creates overhead for NameNode as it maintains metadata. It may optimize certain queries based on where clause, but may cause slow response for queries based on grouping clause.

Can we use multiple partitions in Hive?

Apache Hive organizes tables into partitions. Partitioning is a way of dividing a table into related parts based on the values of particular columns like date, city, and department. Each table in the hive can have one or more partition keys to identify a particular partition.


1 Answers

You can create an external table, then add subfolders as partitions.

CREATE EXTERNAL TABLE test (id BIGINT) PARTITIONED BY ( yymmdd STRING);
ALTER TABLE test ADD PARTITION (yymmdd = '20120921') LOCATION 'loc1';
ALTER TABLE test ADD PARTITION (yymmdd = '20120922') LOCATION 'loc2';
like image 128
Rufus Avatar answered Sep 19 '22 05:09

Rufus