Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I partition a table with HIVE?

I've been playing with Hive for few days now but I still have a hard time with partition.

I've been recording Apache logs (Combine format) in Hadoop for few months. They are stored in row text format, partitioned by date (via flume): /logs/yyyy/mm/dd/hh/*

Example:

/logs/2012/02/10/00/Part01xx (02/10/2012 12:00 am)
/logs/2012/02/10/00/Part02xx
/logs/2012/02/10/13/Part0xxx (02/10/2012 01:00 pm)

The date in the combined log file is following this format [10/Feb/2012:00:00:00 -0800]

How can I create a external table with partition in Hive that use my physical partition. I can't find any good documentation on Hive partition. I found related Question such as:

  • Importing data from HDFS to Hive table
  • partition column in hive

If I load my logs in an external table with Hive, I cannot partition with the time, since it's not the good format (Feb <=> 02). Even if if it was in a good format how do i transform a string "10/02/2012:00:00:00 -0800" into multiple directory "/2012/02/10/00"?

I could eventually use pig script to convert my raw logs into hive tables but at this point I should just be using pig instead of hive to do my reporting.

like image 293
zzarbi Avatar asked Mar 08 '12 23:03

zzarbi


People also ask

Does Hive support partitioning?

Apache Hive organizes tables into partitions. Partitioning is a way of dividing a table into related parts based on the values of particular columns like date, city, and department. Each table in the hive can have one or more partition keys to identify a particular partition.

How do I manually add a partition in Hive?

Syntax: ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...; partition_spec: : (p_column = p_col_value, p_column = p_col_value, ...) The following query is used to add a partition to the employee table.

How partitions on the Hive table can be shown?

The general syntax for showing partitions is as follows: SHOW PARTITIONS [db_name.] table_name [PARTITION(partition_spec)];

What is the partition technique used in Hive?

The partitioning in Hive means dividing the table into some parts based on the values of a particular column like date, course, city or country. The advantage of partitioning is that since the data is stored in slices, the query response time becomes faster.


1 Answers

If I understand correctly, you have files in the folders 4 level deep from the directory logs. In that case, you define your table as external with path 'logs' and partitioned by 4 virtual fields: year, month, day_of_month, hour_of_day.

The partitioning is essentially done for you by Flume.

EDIT 3/9: A lot of details depends on how exactly Flume writes files. But in general terms, your DDL should look something like this:

CREATE TABLE table_name(fields...)
PARTITIONED BY(log_year STRING, log_month STRING, 
    log_day_of_month STRING, log_hour_of_day STRING)
format description
STORED AS TEXTFILE
LOCATION '/your user path/logs';

EDIT 3/15: Per zzarbi request, I'm adding a note that after the table is created, the Hive needs to be informed about partitions created. This needs to be done repeatedly as long as Flume or other process creates new partitions. See my answer to Create external with Partition question.

like image 124
Olaf Avatar answered Oct 01 '22 14:10

Olaf