Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I partition a hive table by (only) a portion of a timestamp column?

Assume I have a Hive table that includes a TIMESTAMP column that is frequently (almost always) included in the WHERE clauses of a query. It makes sense to partition this table by the TIMESTAMP field; however, to keep to a reasonable cardinality, it makes sense to partition by day (not by the maximum resolution of the TIMESTAMP).

What's the best way to achieve this? Should I create an additional column (DATE) and partition on that? Or is there a way to achieve the partition without creating a duplicate column?

like image 698
Jared Avatar asked Feb 03 '17 00:02

Jared


Video Answer


1 Answers

Its not a new column, but its a pseudo-column, You should re-create your table with adding the partitioning specification like this :

create table table_name (
  id                int,
  name              string,
  timestamp         string
)
partitioned by (date string)

Then you load the data creating the partitions dynamically like this

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
FROM table_name_old tno
INSERT OVERWRITE TABLE table_name PARTITION(substring(timestamp,0,10))
       SELECT tno.id, tno.name, tno.timestamp;

Now if you select all from your table you will see a new column for the partition, but consider that a Hive partition is just a subdirectory and its not a real column, hence it does not affect the total table size only by some kilobytes.

like image 125
54l3d Avatar answered Sep 20 '22 17:09

54l3d