Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HIVE - INSERT OVERWRITE vs DROP TABLE + CREATE TABLE + INSERT INTO

I'm doing some automatic script of few queries in hive and we found that we need time to time clear the data from a table and insert the new one. And we are thinking what could be faster?

INSERT OVERWRITE TABLE SOME_TABLE
    SELECT * FROM OTHER_TABLE;

or is faster to do like this:

DROP TABLE SOME_TABLE;
CREATE TABLE SOME_TABLE (STUFFS);
INSERT INTO TABLE
    SELECT * FROM OTHER_TABLE;

The overhead of running the queries is not an issue. Due to we have the script o creation too. The question is, the INSERT OVERWRITE with billion of rows is faster than DROP + CREATE + INSERT INTO?

like image 831
Thiago Baldim Avatar asked Sep 21 '16 13:09

Thiago Baldim


People also ask

What is the difference between insert into and insert overwrite?

Conclusion. In summary the difference between Hive INSERT INTO vs INSERT OVERWRITE, INSERT INTO is used to append the data into Hive tables and partitioned tables and INSERT OVERWRITE is used to remove the existing data from the table and insert the new data.

What is insert overwrite table in Hive?

Description. The INSERT OVERWRITE DIRECTORY with Hive format overwrites the existing data in the directory with the new values using Hive SerDe . Hive support must be enabled to use this command. The inserted rows can be specified by value expressions or result from a query.

What is insert overwrite table?

Description. The INSERT OVERWRITE statement overwrites the existing data in the table using the new values. The inserted rows can be specified by value expressions or result from a query.

What happens when a table is dropped in Hive?

When you drop a table from Hive Metastore, it removes the table/column data and their metadata. It can be a normal table (stored in Metastore) or an external table (stored in local file system); Hive treats both in the same manner, irrespective of their types.


Video Answer


2 Answers

One edge consideration is that if your schema changes, INSERT OVERWRITE will fail, while DROP+CREATE+INSERT will not. While this is unlikely to apply in most scenarios, if you're prototyping workflow/table schemas then it might be worth considering.

like image 125
Brendan Avatar answered Sep 17 '22 16:09

Brendan


For maximum speed I would suggest to 1) issue hadoop fs -rm -r -skipTrash table_dir/* first to remove old data fast without putting files into trash because INSERT OVERWRITE will put all files into Trash and for very big table this will take a lot of time. Then 2) do INSERT OVERWRITE command. This will be faster also because you do not need to drop/create table.

UPDATE:

As of Hive 2.3.0 (HIVE-15880), if the table has TBLPROPERTIES ("auto.purge"="true") the previous data of the table is not moved to Trash when INSERT OVERWRITE query is run against the table. This functionality is applicable only for managed tables. So, INSERT OVERWRITE with auto purge will work faster than rm -skipTrash + INSERT OVERWRITE or DROP+CREATE+INSERT because it will be a single Hive-only command.

like image 28
leftjoin Avatar answered Sep 19 '22 16:09

leftjoin