Partition Hive table by existing field?

Tags:

Can I partition a Hive table upon insert by an existing field?

I have a 10 GB file with a date field and an hour of day field. Can I load this file into a table, then insert-overwrite into another partitioned table that uses those fields as a partition? Would something like the following work?

INSERT OVERWRITE TABLE tealeaf_event  PARTITION(dt=evt.datestring,hour=evt.hour) 
SELECT * FROM staging_event evt;

Thanks!

Travis

923

asked Jul 08 '11 23:07

batman

2 Answers

I just ran across this trying to answer the same question and it was helpful but not quite complete. The short answer is yes, something like the query in the question will work but the syntax is not quite right.

Say you have three tables which were created using the following statements:

CREATE TABLE staging_unpartitioned (datestring string, hour int, a int, b int);

CREATE TABLE staging_partitioned (a int, b int) 
    PARTITIONED BY (datestring string, hour int);

CREATE TABLE production_partitioned (a int, b int) 
    PARTITIONED BY (dt string, hour int);

Columns a and b are just some example columns. dt and hour are the values we want to partition on once it gets to the production table. Moving the staging data to production from staging_unpartitioned and staging_partitioned looks exactly the same.

INSERT OVERWRITE TABLE production_partitioned PARTITION (dt, hour)
    SELECT a, b, datestring, hour FROM staging_unpartitioned;

INSERT OVERWRITE TABLE production_partitioned PARTITION (dt, hour)
    SELECT a, b, datestring, hour FROM staging_partitioned;

This uses a process called Dynamic Partitioning which you can read about here. The important thing to note is that which columns are associated with which partitions is determined by the SELECT order. All dynamic partitions must be selected last and in order.

There's a good chance when you try to run the code above you will hit an error due to the properties you have set. First, it will not work if you have dynamic partitioning disabled so make sure to:

set hive.exec.dynamic.partition=true;

Then you might hit an error if you aren't partitioning on at least one static partition before the dynamic partitions. This restriction would save you accidentally removing a root partition when you meant to overwrite its sub-partitions with dynamic partitions. In my experience this behavior has never been helpful and has often been annoying, but your mileage may vary. At any rate, it is easy to change:

set hive.exec.dynamic.partition.mode=nonstrict;

And that should do it.

answered Oct 06 '22 23:10

Daniel Koverman

Maybe this is already answered... but yes, you can do exactly as you have stated. I have done it many times. Obviously your new table would need to be defined similar to the original one, but without the partition column, and with the partition specification. Also, I cannot remember if I had to explicitly list out the columns in the original table, or if the asterik was sufficient.

answered Oct 07 '22 01:10

Wanderer

Related questions
                            
                                Where are the hadoop-examples* and hadoop-test* jars in Cloudera CDH?
                            
                                Junit External Resource @Rule Order
                            
                                How to run Hadoop on a Mesos cluster?
                            
                                java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
                            
                                Loading CSV file on Hive Table with String Array
                            
                                What is --direct mode in sqoop?
                            
                                How to use NOT IN in Hive
                            
                                realtime querying/aggregating millions of records - hadoop? hbase? cassandra?
                            
                                Get input file name in streaming hadoop program
                            
                                Errors while running hadoop
                            
                                Type mismatch in key from map: expected .. Text, received ... LongWritable
                            
                                HBase 0.92 warnings about SLF4J bindings
                            
                                "Connection refused" Error for Namenode-HDFS (Hadoop Issue)
                            
                                What is the maximum value for mapreduce.task.io.sort.mb?
                            
                                Why Hadoop or Spark? There is ElasticSearch
                            
                                How can I debug a pig script
                            
                                How can I list subdirectories recursively for HDFS?
                            
                                Duplicate columns in Spark Dataframe
                            
                                Structure Difference between partitioning and bucketing in hive
                            
                                Hadoop HDFS maximum file size

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Partition Hive table by existing field?

Tags:

hadoop

database-partitioning

hive

hdfs

partitioning

batman

People also ask

2 Answers

Daniel Koverman

Wanderer

Recent Activity

Donate For Us