SOLVED: See Update #2 below for the 'solution' to this issue. ~~~~~~~ In s3, I have some log*.gz files stored in a nested directory structure like: <pre class="prettyprint"><code>s3://($BUCKET)/y=2012/m=11/d=09/H=10/ </code></pre> I'm attempting to load these into Hive on Elastic Map Reduce (EMR), using a multi-level partition spec like: <pre class="prettyprint"><code>create external table logs (content string) partitioned by (y string, m string, d string, h string) location 's3://($BUCKET)'; </code></pre> Creation of the table works. I then attempt to recover all of the existing partitions: <pre class="prettyprint"><code>alter table logs recover partitions; </code></pre> This seems to work and it does drill down through my s3 structure and add all the various levels of directories: <pre class="prettyprint"><code>hive> show partitions logs; OK y=2012/m=11/d=06/h=08 y=2012/m=11/d=06/h=09 y=2012/m=11/d=06/h=10 y=2012/m=11/d=06/h=11 y=2012/m=11/d=06/h=12 y=2012/m=11/d=06/h=13 y=2012/m=11/d=06/h=14 y=2012/m=11/d=06/h=15 y=2012/m=11/d=06/h=16 ... </code></pre> So it seems that Hive can see and interpret my file layout successfully. However, no actual data ever gets loaded. If I try to do a simple count or select *, I get nothing: <pre class="prettyprint"><code>hive> select count(*) from logs; ... OK 0 hive> select * from logs limit 10; OK hive> select * from logs where y = '2012' and m = '11' and d = '06' and h='16' limit 10; OK </code></pre> Thoughts? Am I missing some additional command to load data beyond recovering the partitions? If I manually add a partition with an explicit location, then that works: <pre class="prettyprint"><code>alter table logs2 add partition (y='2012', m='11', d='09', h='10') location 's3://($BUCKET)/y=2012/m=11/d=09/H=10/' </code></pre> I can just write a script to do this, but it feels like I'm missing something fundamental w.r.t 'recover partitions'. UPDATE #1 Thanks to a brilliant and keen observation by Joe K in a comment below, I think that case sensitivity issues might be involved here. The files are definitely organized like the following path spec, with a capitalized H (I think this might be some nod to iso8601 formatting): <pre class="prettyprint"><code>s3://($BUCKET)/y=2012/m=11/d=09/H=10/ </code></pre> I create my external table with a partition spec that does the proper capitalization: <pre class="prettyprint"><code>partitioned by (y string, m string, d string, H string) </code></pre> (Notice the 'H'). I do a recover partitions, which does seem to recurse through the directories and find the partitions appropriately, but somehow (despite using 'H' in all instructive places so far), it indeed seems that Hive saves it as a lower case 'h': <pre class="prettyprint"><code>hive> show partitions logs; OK y=2012/m=11/d=06/h=08 </code></pre> (Note the 'h'). So it seems that Hive is able to discover the partitions, but then stores them in a lowercase form ... Later when it goes to look for data, these paths are (of course) empty because S3 is case sensitive. I am going to move my data into an all-lowercase directory structure and see if that works... UPDATE #2 Indeed, I have confirmed that the capitalized 'H' as a partition name (in the s3 file layout) was the problem here. As far as I can tell, this is what was happening: <ul> <li>My layout on S3 had a case-sensitive partition name (H=)</li> <li>Running RECOVER PARTITIONS correctly discovers these partitions...</li> <li>But then they are stored internally as lowercase (h)</li> </ul> The 'recover partitions' command is an extension of Hive authored by Amazon. I strongly suspect that the bug is in this component. To my knowledge native Hive has no concept of exploring a file root for partition discovery...

It's a case issue on the hour field!

Loading data with Hive, S3, EMR, and Recover Partitions

Tags:

elastic-map-reduce

SOLVED: See Update #2 below for the 'solution' to this issue.

~~~~~~~

In s3, I have some log*.gz files stored in a nested directory structure like:

s3://($BUCKET)/y=2012/m=11/d=09/H=10/

I'm attempting to load these into Hive on Elastic Map Reduce (EMR), using a multi-level partition spec like:

create external table logs (content string)
partitioned by (y string, m string, d string, h string)
location 's3://($BUCKET)';

Creation of the table works. I then attempt to recover all of the existing partitions:

alter table logs recover partitions;

This seems to work and it does drill down through my s3 structure and add all the various levels of directories:

hive> show partitions logs;
OK
y=2012/m=11/d=06/h=08
y=2012/m=11/d=06/h=09
y=2012/m=11/d=06/h=10
y=2012/m=11/d=06/h=11
y=2012/m=11/d=06/h=12
y=2012/m=11/d=06/h=13
y=2012/m=11/d=06/h=14
y=2012/m=11/d=06/h=15
y=2012/m=11/d=06/h=16
...

So it seems that Hive can see and interpret my file layout successfully. However, no actual data ever gets loaded. If I try to do a simple count or select *, I get nothing:

hive> select count(*) from logs;
...
OK
0

hive> select * from logs limit 10;
OK

hive> select * from logs where y = '2012' and m = '11' and d = '06' and h='16' limit 10;
OK

Thoughts? Am I missing some additional command to load data beyond recovering the partitions?

If I manually add a partition with an explicit location, then that works:

alter table logs2 add partition (y='2012', m='11', d='09', h='10') location 's3://($BUCKET)/y=2012/m=11/d=09/H=10/'

I can just write a script to do this, but it feels like I'm missing something fundamental w.r.t 'recover partitions'.

UPDATE #1

Thanks to a brilliant and keen observation by Joe K in a comment below, I think that case sensitivity issues might be involved here.

The files are definitely organized like the following path spec, with a capitalized H (I think this might be some nod to iso8601 formatting):

s3://($BUCKET)/y=2012/m=11/d=09/H=10/

I create my external table with a partition spec that does the proper capitalization:

partitioned by (y string, m string, d string, H string)

(Notice the 'H'). I do a recover partitions, which does seem to recurse through the directories and find the partitions appropriately, but somehow (despite using 'H' in all instructive places so far), it indeed seems that Hive saves it as a lower case 'h':

hive> show partitions logs;
OK
y=2012/m=11/d=06/h=08

(Note the 'h'). So it seems that Hive is able to discover the partitions, but then stores them in a lowercase form ... Later when it goes to look for data, these paths are (of course) empty because S3 is case sensitive.

I am going to move my data into an all-lowercase directory structure and see if that works...

UPDATE #2

Indeed, I have confirmed that the capitalized 'H' as a partition name (in the s3 file layout) was the problem here. As far as I can tell, this is what was happening:

My layout on S3 had a case-sensitive partition name (H=)
Running RECOVER PARTITIONS correctly discovers these partitions...
But then they are stored internally as lowercase (h)

The 'recover partitions' command is an extension of Hive authored by Amazon. I strongly suspect that the bug is in this component. To my knowledge native Hive has no concept of exploring a file root for partition discovery...

343

asked Nov 10 '12 03:11

Mike Repass

1 Answers

It's a case issue on the hour field!

170

answered Sep 21 '22 11:09

Joe K

Related questions
                            
                                how to design Hbase schema?
                            
                                Is there any distributed file system which runs on Windows except Hadoop? [closed]
                            
                                The node /hbase is not in ZooKeeper
                            
                                Big data signal analysis: better way to store and query signal data
                            
                                Classify data using Apache Mahout
                            
                                How would you suggest performing "Join" with Hadoop streaming?
                            
                                HBase ERROR: hbase-default.xml file seems to be for and old version of HBase (null)
                            
                                Bypassing org.apache.hadoop.mapred.InvalidInputException: Input Pattern s3n://[...] matches 0 files
                            
                                Hive create table with inputs from nested sub-directories
                            
                                In spark join, does table order matter like in pig?
                            
                                HBase: Thrift vs Rest performance
                            
                                Get error "mismatched input 'as' expecting FROM near ')' in from clause" when run sql query Hadoop Java
                            
                                How to decide on the number of partitions required for input data size and cluster resources?
                            
                                When to prefer Hadoop MapReduce over Spark?
                            
                                Run Apache Flink with Amazon S3
                            
                                Retrieve files from remote HDFS
                            
                                Number of reducers in hadoop
                            
                                Hadoop, Hive, Pig, HBase, Cassandra - when to use what? [closed]
                            
                                Hadoop slave files configuration
                            
                                MapReduce shuffle/sort method

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Loading data with Hive, S3, EMR, and Recover Partitions

Tags:

amazon-web-services

amazon-s3

hadoop

hive

elastic-map-reduce

Mike Repass

People also ask

1 Answers

Joe K

Recent Activity

Donate For Us