Table Partitioned by Timestamp Field

Tags:

hive

In order to generate some summary figures we are importing data periodically to Hive. We are currently using a CSV file format and its layout is as follows:

operation,item,timestamp,user,marketingkey

Currently we have a few queries that are performing grouping over date (yyyy-mm-dd) of timestamp field.

The files that are being imported are holding sometimes more days and I would like to store it in a partitioned way. Is there a way to do it with Hive, I have build the table based on the following DDL:

CREATE TABLE 
   partitionedTable (name string) 
PARTITIONED BY (time bigint) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

The data loading was done like:

LOAD DATA LOCAL INPATH 
   '/home/spaeth/tmp/hadoop-billing-data/extracted/testData.csv' 
INTO TABLE partitionedTable PARTITION(time='2013-05-01');

But I would like that hive applies the partitioning in an automatic way based on a field that comes within the file that is being imported. For example:

login,1,1370793184,user1,none --> stored to partition 2013-06-09
login,2,1360793184,user1,none --> stored to partition 2013-02-13
login,1,1360571184,user2,none --> stored to partition 2013-02-11
buy,2,1360501184,user2,key1   --> stored to partition 2013-02-10

488

asked Nov 25 '13 12:11

Francisco Spaeth

1 Answers

It seems like you are looking for dynamic partitioning, and Hive supports dynamic partition inserts as detailed in this article.

First, you need to create a temporary table where you will put your flat data with no partition at all. In your case this would be:

CREATE TABLE 
    flatTable (type string, id int, ts bigint, user string, key string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Then, you should load your flat data file into this directory:

LOAD DATA LOCAL INPATH
    '/home/spaeth/tmp/hadoop-billing-data/extracted/testData.csv'
INTO TABLE flatTable;

At that point you can use the dynamic partition insert. A few things to keep in mind are that you'll need the following properties:

hive.exec.dynamic.partition should be set to true because dynamic partition is disabled by default I believe.
hive.exec.dynamic.partition.mode should be set to nonstrict because you have a single partition and strict mode enforces that you need one static partition.

So you can run the following query:

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
FROM
    flatTable
INSERT OVERWRITE TABLE
    partitionedTable
PARTITION(time)
SELECT
    user, from_unixtime(ts, 'yyyy-MM-dd') AS time

This should spawn 2 MapReduce jobs, and at the end you should see something along the lines of:

Loading data to table default.partitionedtable partition (time=null)
    Loading partition {time=2013-02-10}
    Loading partition {time=2013-02-11}
    Loading partition {time=2013-02-13}
    Loading partition {time=2013-06-09}

And to verify that your partitions are indeed here:

$ hadoop fs -ls /user/hive/warehouse/partitionedTable/
Found 4 items
drwxr-xr-x   - username supergroup          0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-02-10
drwxr-xr-x   - username supergroup          0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-02-11
drwxr-xr-x   - username supergroup          0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-02-13
drwxr-xr-x   - username supergroup          0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-06-09

Please note that dynamic partitions are only supported since Hive 0.6, so if you have an older version this is probably not going to work.

179

answered Sep 22 '22 21:09

Charles Menguy

Related questions
                            
                                hadoop mapreduce: java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
                            
                                Hadoop Pig - Removing csv header
                            
                                stop-all.sh in Spark sbin/ folder is not stopping all slave nodes
                            
                                What does code generation mean in avro - hadoop
                            
                                wrong value class: class org.apache.hadoop.io.Text is not class org.apache.hadoop.io.IntWritable
                            
                                Presto unnest json
                            
                                Where does Big Data go and how is it stored?
                            
                                Where is my sparkDF.persist(DISK_ONLY) data stored?
                            
                                How to get the name of input file in MRjob
                            
                                Summing values of Hive array types
                            
                                Hadoop MRUnit throws exception
                            
                                Calculate count of distinct values of a field using pig script
                            
                                HDFS performance for small files
                            
                                How to add SerDe jar
                            
                                Sqoop Hive exited with status 1
                            
                                Simple User/Password authentication for HiveServer2 (without Kerberos/LDAP)
                            
                                How can I check Oozie logs
                            
                                How can I reload oozie job configuration file without restart oozie job
                            
                                How to use a file in a hadoop streaming job using python?
                            
                                Namenode failure and recovery in Hadoop

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With