Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Table Partitioned by Timestamp Field

Tags:

hadoop

hive

In order to generate some summary figures we are importing data periodically to Hive. We are currently using a CSV file format and its layout is as follows:

operation,item,timestamp,user,marketingkey

Currently we have a few queries that are performing grouping over date (yyyy-mm-dd) of timestamp field.

The files that are being imported are holding sometimes more days and I would like to store it in a partitioned way. Is there a way to do it with Hive, I have build the table based on the following DDL:

CREATE TABLE 
   partitionedTable (name string) 
PARTITIONED BY (time bigint) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

The data loading was done like:

LOAD DATA LOCAL INPATH 
   '/home/spaeth/tmp/hadoop-billing-data/extracted/testData.csv' 
INTO TABLE partitionedTable PARTITION(time='2013-05-01');

But I would like that hive applies the partitioning in an automatic way based on a field that comes within the file that is being imported. For example:

login,1,1370793184,user1,none --> stored to partition 2013-06-09
login,2,1360793184,user1,none --> stored to partition 2013-02-13
login,1,1360571184,user2,none --> stored to partition 2013-02-11
buy,2,1360501184,user2,key1   --> stored to partition 2013-02-10
like image 488
Francisco Spaeth Avatar asked Nov 25 '13 12:11

Francisco Spaeth


People also ask

What is time based partitioning?

Time-based partitions Time partitioning is a convenient way to add a retention policy to persistent data. Instead of purging old rows based on a timestamp, which takes time, resources, and planning to avoid impacting regular database activity, the data is partitioned in multiple tables, a.k.a shards.

Can you partition by date in SQL?

You can partition on DateCreated column after you make it a clustered index.

What does it mean when a table is partitioned?

A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query.

Can you index a partitioned table?

Partitioned tables can have indexes that are nonpartitioned (existing in a single table space within a database partition), indexes that are themselves partitioned across one or more table spaces within a database partition, or a combination of the two.


1 Answers

It seems like you are looking for dynamic partitioning, and Hive supports dynamic partition inserts as detailed in this article.

First, you need to create a temporary table where you will put your flat data with no partition at all. In your case this would be:

CREATE TABLE 
    flatTable (type string, id int, ts bigint, user string, key string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Then, you should load your flat data file into this directory:

LOAD DATA LOCAL INPATH
    '/home/spaeth/tmp/hadoop-billing-data/extracted/testData.csv'
INTO TABLE flatTable;

At that point you can use the dynamic partition insert. A few things to keep in mind are that you'll need the following properties:

  • hive.exec.dynamic.partition should be set to true because dynamic partition is disabled by default I believe.
  • hive.exec.dynamic.partition.mode should be set to nonstrict because you have a single partition and strict mode enforces that you need one static partition.

So you can run the following query:

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
FROM
    flatTable
INSERT OVERWRITE TABLE
    partitionedTable
PARTITION(time)
SELECT
    user, from_unixtime(ts, 'yyyy-MM-dd') AS time

This should spawn 2 MapReduce jobs, and at the end you should see something along the lines of:

Loading data to table default.partitionedtable partition (time=null)
    Loading partition {time=2013-02-10}
    Loading partition {time=2013-02-11}
    Loading partition {time=2013-02-13}
    Loading partition {time=2013-06-09}

And to verify that your partitions are indeed here:

$ hadoop fs -ls /user/hive/warehouse/partitionedTable/
Found 4 items
drwxr-xr-x   - username supergroup          0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-02-10
drwxr-xr-x   - username supergroup          0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-02-11
drwxr-xr-x   - username supergroup          0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-02-13
drwxr-xr-x   - username supergroup          0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-06-09

Please note that dynamic partitions are only supported since Hive 0.6, so if you have an older version this is probably not going to work.

like image 179
Charles Menguy Avatar answered Sep 22 '22 21:09

Charles Menguy