Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

automatically partition Hive tables based on S3 directory names

Tags:

amazon-s3

hive

I have data stored in S3 like:

/bucket/date=20140701/file1
/bucket/date=20140701/file2
...
/bucket/date=20140701/fileN

/bucket/date=20140702/file1
/bucket/date=20140702/file2
...
/bucket/date=20140702/fileN
...

My understanding is that if I pull in that data via Hive, it will automatically interpret date as a partition. My table creation looks like:

CREATE EXTERNAL TABLE search_input(
   col 1 STRING,
   col 2 STRING,
   ...

)
PARTITIONED BY(date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/';

However Hive doesn't recognize any data. Any queries I run return with 0 results. If I instead just grab one of the dates via:

CREATE EXTERNAL TABLE search_input_20140701(
   col 1 STRING,
   col 2 STRING,
   ...

)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/date=20140701';

I can query data just fine.

Why doesn't Hive recognize the nested directories with the "date=date_str" partition? Is there a better way to have Hive run a query over multiple sub-directories and slice it based on a datetime string?

like image 434
gallamine Avatar asked Aug 04 '14 20:08

gallamine


People also ask

What is Hive dynamic partitioning?

Dynamic partitioning is the strategic approach to load the data from the non-partitioned table where the single insert to the partition table is called a dynamic partition.

How do I sync a partition in Hive?

You need to synchronize the metastore and the file system. You can refresh Hive metastore partition information manually or automatically. You run the MSCK (metastore consistency check) Hive command: MSCK REPAIR TABLE table_name SYNC PARTITIONS every time you need to synchronize a partition with your file system.

What are the 2 types of partitioning in Hive?

Partitioning in the hive can be static or dynamic.


1 Answers

In order to get this to work I had to do 2 things:

  1. Enable recursive directory support:
SET mapred.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;
  1. For some reason it would still not recognize my partitions so I had to recover them via:
ALTER TABLE search_input RECOVER PARTITIONS;

You can use:

SHOW PARTITIONS table;

to check and see that they've been recovered.

like image 94
gallamine Avatar answered Nov 15 '22 10:11

gallamine