In order to generate some summary figures we are importing data periodically to Hive. We are currently using a CSV file format and its layout is as follows:
operation,item,timestamp,user,marketingkey
Currently we have a few queries that are performing grouping over date (yyyy-mm-dd) of timestamp field.
The files that are being imported are holding sometimes more days and I would like to store it in a partitioned way. Is there a way to do it with Hive, I have build the table based on the following DDL:
CREATE TABLE
partitionedTable (name string)
PARTITIONED BY (time bigint)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
The data loading was done like:
LOAD DATA LOCAL INPATH
'/home/spaeth/tmp/hadoop-billing-data/extracted/testData.csv'
INTO TABLE partitionedTable PARTITION(time='2013-05-01');
But I would like that hive applies the partitioning in an automatic way based on a field that comes within the file that is being imported. For example:
login,1,1370793184,user1,none --> stored to partition 2013-06-09
login,2,1360793184,user1,none --> stored to partition 2013-02-13
login,1,1360571184,user2,none --> stored to partition 2013-02-11
buy,2,1360501184,user2,key1 --> stored to partition 2013-02-10
Time-based partitions Time partitioning is a convenient way to add a retention policy to persistent data. Instead of purging old rows based on a timestamp, which takes time, resources, and planning to avoid impacting regular database activity, the data is partitioned in multiple tables, a.k.a shards.
You can partition on DateCreated column after you make it a clustered index.
A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query.
Partitioned tables can have indexes that are nonpartitioned (existing in a single table space within a database partition), indexes that are themselves partitioned across one or more table spaces within a database partition, or a combination of the two.
It seems like you are looking for dynamic partitioning, and Hive supports dynamic partition inserts as detailed in this article.
First, you need to create a temporary table where you will put your flat data with no partition at all. In your case this would be:
CREATE TABLE
flatTable (type string, id int, ts bigint, user string, key string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Then, you should load your flat data file into this directory:
LOAD DATA LOCAL INPATH
'/home/spaeth/tmp/hadoop-billing-data/extracted/testData.csv'
INTO TABLE flatTable;
At that point you can use the dynamic partition insert. A few things to keep in mind are that you'll need the following properties:
hive.exec.dynamic.partition
should be set to true
because dynamic partition is disabled by default I believe.hive.exec.dynamic.partition.mode
should be set to nonstrict
because you have a single partition and strict mode enforces that you need one static partition.So you can run the following query:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
FROM
flatTable
INSERT OVERWRITE TABLE
partitionedTable
PARTITION(time)
SELECT
user, from_unixtime(ts, 'yyyy-MM-dd') AS time
This should spawn 2 MapReduce jobs, and at the end you should see something along the lines of:
Loading data to table default.partitionedtable partition (time=null)
Loading partition {time=2013-02-10}
Loading partition {time=2013-02-11}
Loading partition {time=2013-02-13}
Loading partition {time=2013-06-09}
And to verify that your partitions are indeed here:
$ hadoop fs -ls /user/hive/warehouse/partitionedTable/
Found 4 items
drwxr-xr-x - username supergroup 0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-02-10
drwxr-xr-x - username supergroup 0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-02-11
drwxr-xr-x - username supergroup 0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-02-13
drwxr-xr-x - username supergroup 0 2013-11-25 18:35 /user/hive/warehouse/partitionedTable/time=2013-06-09
Please note that dynamic partitions are only supported since Hive 0.6, so if you have an older version this is probably not going to work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With