I have data stored in S3 like: <pre class="prettyprint"><code>/bucket/date=20140701/file1 /bucket/date=20140701/file2 ... /bucket/date=20140701/fileN /bucket/date=20140702/file1 /bucket/date=20140702/file2 ... /bucket/date=20140702/fileN ... </code></pre> My understanding is that if I pull in that data via Hive, it will automatically interpret <code>date</code> as a partition. My table creation looks like: <pre class="prettyprint"><code>CREATE EXTERNAL TABLE search_input( col 1 STRING, col 2 STRING, ... ) PARTITIONED BY(date STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' LOCATION 's3n://bucket/'; </code></pre> However Hive doesn't recognize any data. Any queries I run return with 0 results. If I instead just grab one of the dates via: <pre class="prettyprint"><code>CREATE EXTERNAL TABLE search_input_20140701( col 1 STRING, col 2 STRING, ... ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' LOCATION 's3n://bucket/date=20140701'; </code></pre> I can query data just fine. Why doesn't Hive recognize the nested directories with the "date=date_str" partition? Is there a better way to have Hive run a query over multiple sub-directories and slice it based on a datetime string?

In order to get this to work I had to do 2 things: <ol> <li>Enable recursive directory support:</li> </ol> <blockquote> <pre class="prettyprint"><code>SET mapred.input.dir.recursive=true; SET hive.mapred.supports.subdirectories=true; </code></pre> </blockquote> <ol> <li>For some reason it would still not recognize my partitions so I had to recover them via:</li> </ol> <blockquote> <pre class="prettyprint"><code>ALTER TABLE search_input RECOVER PARTITIONS; </code></pre> </blockquote> You can use: <pre class="prettyprint"><code>SHOW PARTITIONS table; </code></pre> to check and see that they've been recovered.

automatically partition Hive tables based on S3 directory names

Tags:

amazon-s3

hive

I have data stored in S3 like:

/bucket/date=20140701/file1
/bucket/date=20140701/file2
...
/bucket/date=20140701/fileN

/bucket/date=20140702/file1
/bucket/date=20140702/file2
...
/bucket/date=20140702/fileN
...

My understanding is that if I pull in that data via Hive, it will automatically interpret date as a partition. My table creation looks like:

CREATE EXTERNAL TABLE search_input(
   col 1 STRING,
   col 2 STRING,
   ...

)
PARTITIONED BY(date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/';

However Hive doesn't recognize any data. Any queries I run return with 0 results. If I instead just grab one of the dates via:

CREATE EXTERNAL TABLE search_input_20140701(
   col 1 STRING,
   col 2 STRING,
   ...

)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/date=20140701';

I can query data just fine.

Why doesn't Hive recognize the nested directories with the "date=date_str" partition? Is there a better way to have Hive run a query over multiple sub-directories and slice it based on a datetime string?

434

asked Aug 04 '14 20:08

gallamine

1 Answers

In order to get this to work I had to do 2 things:

Enable recursive directory support:

SET mapred.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;

For some reason it would still not recognize my partitions so I had to recover them via:

ALTER TABLE search_input RECOVER PARTITIONS;

You can use:

SHOW PARTITIONS table;

to check and see that they've been recovered.

answered Nov 15 '22 10:11

gallamine

Related questions
                            
                                Setting foobar.com and www.foobar.com to point to my Amazon S3–hosted site
                            
                                Setting content-encoding and content-type with Amazon API for .NET
                            
                                Posting image to S3 Conflicting query string parameters
                            
                                Amazon S3 SDK: Change filename on download?
                            
                                How to allow aws s3 bucket contents only to publicly serve a certain domain?
                            
                                Is there conflict between bucket policies and signed urls?
                            
                                how to get a 304 for images stored on amazon s3 when using django-storages on django app?
                            
                                Is CloudFront capable for files in 10-30MB?
                            
                                throttling amazon s3 to limit budget
                            
                                Best way to host user uploaded videos on AWS
                            
                                How can I embed an S3 hosted PDF? instead of having it download automatically in the browser?
                            
                                Checking for successful S3 copy operation?
                            
                                Amazon S3 cache audio files
                            
                                Change filename for user when downloading directly from AWS S3
                            
                                Amazon S3 Java SDK multiple files upload
                            
                                Configure PHP Monolog to log to Amazon S3 via stream
                            
                                Spark lists all leaf node even in partitioned data
                            
                                How to write an S3 object to a file?
                            
                                Receive AccessDenied when trying to access a reload or refresh or one in new tab in angular 5
                            
                                How to update metadata of an existing object in AWS S3 using python boto3?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

automatically partition Hive tables based on S3 directory names

Tags:

amazon-s3

hive

gallamine

People also ask

1 Answers

gallamine

Recent Activity

Donate For Us