How can I partition a hive table by (only) a portion of a timestamp column?

Question

Assume I have a Hive table that includes a TIMESTAMP column that is frequently (almost always) included in the WHERE clauses of a query. It makes sense to partition this table by the TIMESTAMP field; however, to keep to a reasonable cardinality, it makes sense to partition by day (not by the maximum resolution of the TIMESTAMP).

What's the best way to achieve this? Should I create an additional column (DATE) and partition on that? Or is there a way to achieve the partition without creating a duplicate column?

54l3d · Accepted Answer

Its not a new column, but its a pseudo-column, You should re-create your table with adding the partitioning specification like this :

create table table_name (
  id                int,
  name              string,
  timestamp         string
)
partitioned by (date string)

Then you load the data creating the partitions dynamically like this

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
FROM table_name_old tno
INSERT OVERWRITE TABLE table_name PARTITION(substring(timestamp,0,10))
       SELECT tno.id, tno.name, tno.timestamp;

Now if you select all from your table you will see a new column for the partition, but consider that a Hive partition is just a subdirectory and its not a real column, hence it does not affect the total table size only by some kilobytes.

How can I partition a hive table by (only) a portion of a timestamp column?

Tags:

hadoop

hive

hortonworks-data-platform

Jared

Video Answer

1 Answers

54l3d

Recent Activity

Donate For Us

How can I partition a hive table by (only) a portion of a timestamp column?

Tags:

hadoop

hive

hortonworks-data-platform

Jared

Video Answer

1 Answers

54l3d

Related questions

Recent Activity

Donate For Us