Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive : Insert overwrite multiple partitions

Tags:

hadoop

hive

I have a Hive table partitioned on date. I want to be able to selectively overwrite the partitions for the last 'n' days (or custom list of partitions).

Is there a way to do it without writing "INSERT OVERWRITE DIRECTORY" statement for each partition?

Any help is greatly appreciated.

like image 676
rahul Avatar asked Sep 06 '13 22:09

rahul


People also ask

Can we overwrite partition in Hive?

INSERT OVERWRITE is used to replace any existing data in the table or partition and insert with the new rows.

How does insert overwrite work in Hive?

The INSERT OVERWRITE DIRECTORY with Hive format overwrites the existing data in the directory with the new values using Hive SerDe . Hive support must be enabled to use this command. The inserted rows can be specified by value expressions or result from a query.

Does insert into overwrite?

In summary the difference between Hive INSERT INTO vs INSERT OVERWRITE, INSERT INTO is used to append the data into Hive tables and partitioned tables and INSERT OVERWRITE is used to remove the existing data from the table and insert the new data.

How do I add multiple partitions in Hive?

Tables are divided into partitions using Apache Hive. Partitioning divides a table into divisions based on the values of specific columns such as date (month, year,etc) , region, and sector. ALTER TABLE ADD PARTITION is used to add partitions to a table. The partition values should only be quoted if they are strings.


1 Answers

Hive supports dynamic partitioning, so you can build a query where the partition is just one of the source fields.

INSERT OVERWRITE TABLE dst partition (dt) 
SELECT col0, col1, ... coln, dt from src where ...

The where clause can specify which values of dt you want to overwrite.

Just include the partition field (dt in this case) last in the list from the source, you can even do SELECT *, dt if the dt field is already part of the source or even SELECT *,my_udf(dt) as dt, etc

By default, Hive wants at least one of the partitions specified to be static, but you can allow it to be nonstrict; so for the above query, you can set the following before the running:

set hive.exec.dynamic.partition.mode=nonstrict;
like image 90
libjack Avatar answered Sep 22 '22 18:09

libjack