I want to create a Hive table where the input textfiles are traversed onto multiple sub-directories in hdfs. So example I have in hdfs:
/testdata/user/Jan/part-0001
/testdata/user/Feb/part-0001
/testdata/user/Mar/part-0001
and so on...
If i want to create a table user in hive, but have it be able to traverse the sub-directories of user, can that be done? I tried something like this, but doesn't work;
CREATE EXTERNAL TABLE users (id int, name string)
STORED AS TEXTFILE LOCATION '/testdata/user/*'
I thought adding the wildcard would work but doesn't. When I tried not using wildcard still does not work. However, if I copy the files into the root directory of user, then it works. Is there no way for Hive to traverse to the child-directories, and grab those files?
Yes, we can have multiple hive tables with the same underlying HDFS directory.
Limitations: Having large number of partitions create number of files/ directories in HDFS, which creates overhead for NameNode as it maintains metadata. It may optimize certain queries based on where clause, but may cause slow response for queries based on grouping clause.
Apache Hive organizes tables into partitions. Partitioning is a way of dividing a table into related parts based on the values of particular columns like date, city, and department. Each table in the hive can have one or more partition keys to identify a particular partition.
You can create an external table, then add subfolders as partitions.
CREATE EXTERNAL TABLE test (id BIGINT) PARTITIONED BY ( yymmdd STRING);
ALTER TABLE test ADD PARTITION (yymmdd = '20120921') LOCATION 'loc1';
ALTER TABLE test ADD PARTITION (yymmdd = '20120922') LOCATION 'loc2';
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With