Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add partition using hive by a specific date?

I'm using hive (with external tables) to process data stored on amazon S3.

My data is partitioned as follows:

                       DIR   s3://test.com/2014-03-01/
                       DIR   s3://test.com/2014-03-02/
                       DIR   s3://test.com/2014-03-03/
                       DIR   s3://test.com/2014-03-04/
                       DIR   s3://test.com/2014-03-05/

s3://test.com/2014-03-05/ip-foo-request-2014-03-05_04-20_00-49.log
s3://test.com/2014-03-05/ip-foo-request-2014-03-05_06-26_19-56.log
s3://test.com/2014-03-05/ip-foo-request-2014-03-05_15-20_12-53.log
s3://test.com/2014-03-05/ip-foo-request-2014-03-05_22-54_27-19.log

How to create a partition table using hive?

   CREATE EXTERNAL TABLE test (
    foo string,
    time string,
    bar string
    )  PARTITIONED BY (? string)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    LOCATION 's3://test.com/';

Could somebody answer this question ? Thanks!

like image 642
Brisi Avatar asked Mar 06 '14 09:03

Brisi


People also ask

How do I partition a date in Hive?

First you need to create a hive non partition table on raw data. Then you need to create partition table in hive then insert from non partition table to partition table. Right now my hive normal table(i.e not partition table) having these list of records.


2 Answers

First start with the right table definition. In your case I'll just use what you wrote:

CREATE EXTERNAL TABLE test (
    foo string,
    time string,
    bar string
)  PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://test.com/';

Hive by default expects partitions to be in subdirectories named via the convention s3://test.com/partitionkey=partitionvalue. For example

s3://test.com/dt=2014-03-05

If you follow this convention you can use MSCK to add all partitions.

If you can't or don't want to use this naming convention, you will need to add all partitions as in:

ALTER TABLE test
    ADD PARTITION (dt='2014-03-05')
    location 's3://test.com/2014-03-05'
like image 102
Carter Shanklin Avatar answered Oct 07 '22 23:10

Carter Shanklin


If you have existing directory structure that doesn't comply <partition name>=<partition value>, you have to add partitions manually. MSCK REPAIR TABLE won't work unless you structure your directory like so.

After you specify location on table creation like:

CREATE EXTERNAL TABLE test (
    foo string,
    time string,
    bar string
)  
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://test.com/';

You can add partition without specifying full path:

ALTER TABLE test ADD PARTITION (dt='2014-03-05') LOCATION '2014-03-05';

Although I've never checked it, I suggest you to move your partitions into a folder inside the bucket, not directly in the bucket itself. E.g. from s3://test.com/ to s3://test.com/data/.

like image 31
cakraww Avatar answered Oct 08 '22 01:10

cakraww