Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive create table with inputs from nested sub-directories

I have data in Avro format in HDFS in file paths like: /data/logs/[foldername]/[filename].avro. I want to create a Hive table over all these log files, i.e. all files of the form /data/logs/*/*. (They're all based on the same Avro schema.)

I'm running the below query with flag mapred.input.dir.recursive=true:

CREATE EXTERNAL TABLE default.testtable
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  LOCATION 'hdfs://.../data/*/*'
  TBLPROPERTIES (
    'avro.schema.url'='hdfs://.../schema.avsc') 

The table ends up being empty unless I change LOCATION to be less nested, i.e. to be 'hdfs://.../data/[foldername]/' with a certain foldername. This worked no-problem with a less nested path for LOCATION.

I'd like to be able to source data from all these different [foldername] folders. How do I make the recursive input selection go further in my nested directories?

like image 581
Maxim Zaslavsky Avatar asked Jun 26 '14 18:06

Maxim Zaslavsky


2 Answers

Use this Hive settings to enable recursive directories:

set hive.mapred.supports.subdirectories=TRUE;
set mapred.input.dir.recursive=TRUE;

Create external table and specify root directory as a location:

LOCATION 'hdfs://.../data'

You will be able to query data from table location and all subdirectories

like image 175
leftjoin Avatar answered Nov 02 '22 17:11

leftjoin


One thing that would solve your problem is adding the folder name as a partition column to the external table. Then you can create the table as you're creating just on the data directory. Or you can take these nested files and flatten them in a single directory.

I don't think you'll be able to ask hive to have input of all these folders considered as 1 table otherwise.

This questions seems to be addressing a similar issue: when creating an external table in hive can I point the location to specific files in a direcotry?

There is an open jira issue on the same context: https://issues.apache.org/jira/browse/HIVE-951

Browsing more I saw this post suggesting you use SimlinkInputTextFormat as an alternative. I am not sure how well this would fly with your Avro format. https://hive.apache.org/javadocs/r0.10.0/api/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.html

like image 37
Dhruv Kapur Avatar answered Nov 02 '22 18:11

Dhruv Kapur