SOLVED: See Update #2 below for the 'solution' to this issue.
~~~~~~~
In s3, I have some log*.gz files stored in a nested directory structure like:
s3://($BUCKET)/y=2012/m=11/d=09/H=10/
I'm attempting to load these into Hive on Elastic Map Reduce (EMR), using a multi-level partition spec like:
create external table logs (content string)
partitioned by (y string, m string, d string, h string)
location 's3://($BUCKET)';
Creation of the table works. I then attempt to recover all of the existing partitions:
alter table logs recover partitions;
This seems to work and it does drill down through my s3 structure and add all the various levels of directories:
hive> show partitions logs;
OK
y=2012/m=11/d=06/h=08
y=2012/m=11/d=06/h=09
y=2012/m=11/d=06/h=10
y=2012/m=11/d=06/h=11
y=2012/m=11/d=06/h=12
y=2012/m=11/d=06/h=13
y=2012/m=11/d=06/h=14
y=2012/m=11/d=06/h=15
y=2012/m=11/d=06/h=16
...
So it seems that Hive can see and interpret my file layout successfully. However, no actual data ever gets loaded. If I try to do a simple count or select *, I get nothing:
hive> select count(*) from logs;
...
OK
0
hive> select * from logs limit 10;
OK
hive> select * from logs where y = '2012' and m = '11' and d = '06' and h='16' limit 10;
OK
Thoughts? Am I missing some additional command to load data beyond recovering the partitions?
If I manually add a partition with an explicit location, then that works:
alter table logs2 add partition (y='2012', m='11', d='09', h='10') location 's3://($BUCKET)/y=2012/m=11/d=09/H=10/'
I can just write a script to do this, but it feels like I'm missing something fundamental w.r.t 'recover partitions'.
UPDATE #1
Thanks to a brilliant and keen observation by Joe K in a comment below, I think that case sensitivity issues might be involved here.
The files are definitely organized like the following path spec, with a capitalized H (I think this might be some nod to iso8601 formatting):
s3://($BUCKET)/y=2012/m=11/d=09/H=10/
I create my external table with a partition spec that does the proper capitalization:
partitioned by (y string, m string, d string, H string)
(Notice the 'H'). I do a recover partitions, which does seem to recurse through the directories and find the partitions appropriately, but somehow (despite using 'H' in all instructive places so far), it indeed seems that Hive saves it as a lower case 'h':
hive> show partitions logs;
OK
y=2012/m=11/d=06/h=08
(Note the 'h'). So it seems that Hive is able to discover the partitions, but then stores them in a lowercase form ... Later when it goes to look for data, these paths are (of course) empty because S3 is case sensitive.
I am going to move my data into an all-lowercase directory structure and see if that works...
UPDATE #2
Indeed, I have confirmed that the capitalized 'H' as a partition name (in the s3 file layout) was the problem here. As far as I can tell, this is what was happening:
The 'recover partitions' command is an extension of Hive authored by Amazon. I strongly suspect that the bug is in this component. To my knowledge native Hive has no concept of exploring a file root for partition discovery...
ALTER TABLE RECOVER PARTITIONS is the command that is widely used in Hive to refresh partitions as new partitions are directly added to the file system by other users. Qubole has added path validation to the ALTER TABLE RECOVER PARTITIONS command. The command only recovers valid paths.
Even when reading a file from an S3 bucket, Spark (by default) creates one partition per block i.e. total no of partitions = total-file-size / block-size.
Having partitions in Amazon S3 helps with Athena query performance, because this helps you run targeted queries for only specific partitions. Athena then scans only those partitions, saving you query costs and query time. For information about partitioning syntax, search for partitioned_by in CREATE TABLE AS.
You need to synchronize the metastore and the file system. You can refresh Hive metastore partition information manually or automatically. You run the MSCK (metastore consistency check) Hive command: MSCK REPAIR TABLE table_name SYNC PARTITIONS every time you need to synchronize a partition with your file system.
It's a case issue on the hour field!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With