I'm doing some automatic script of few queries in hive and we found that we need time to time clear the data from a table and insert the new one. And we are thinking what could be faster? <pre class="prettyprint"><code>INSERT OVERWRITE TABLE SOME_TABLE SELECT * FROM OTHER_TABLE; </code></pre> or is faster to do like this: <pre class="prettyprint"><code>DROP TABLE SOME_TABLE; CREATE TABLE SOME_TABLE (STUFFS); INSERT INTO TABLE SELECT * FROM OTHER_TABLE; </code></pre> The overhead of running the queries is not an issue. Due to we have the script o creation too. The question is, the <code>INSERT OVERWRITE</code> with billion of rows is faster than <code>DROP + CREATE + INSERT INTO</code>?

One edge consideration is that if your schema changes, <code>INSERT OVERWRITE</code> will fail, while <code>DROP</code>+<code>CREATE</code>+<code>INSERT</code> will not. While this is unlikely to apply in most scenarios, if you're prototyping workflow/table schemas then it might be worth considering.

For maximum speed I would suggest to 1) issue <code>hadoop fs -rm -r -skipTrash table_dir/*</code> first to remove old data fast without putting files into trash because INSERT OVERWRITE will put all files into Trash and for very big table this will take a lot of time. Then 2) do <code>INSERT OVERWRITE</code> command. This will be faster also because you do not need to drop/create table. UPDATE: As of Hive 2.3.0 (HIVE-15880), if the table has <code>TBLPROPERTIES ("auto.purge"="true")</code> the previous data of the table is not moved to Trash when <code>INSERT OVERWRITE</code> query is run against the table. This functionality is applicable only for managed tables. So, INSERT OVERWRITE with auto purge will work faster than <code>rm -skipTrash</code> + <code>INSERT OVERWRITE</code> or <code>DROP</code>+<code>CREATE</code>+<code>INSERT</code> because it will be a single Hive-only command.

HIVE - INSERT OVERWRITE vs DROP TABLE + CREATE TABLE + INSERT INTO

Tags:

create-table

hive

hiveql

hiveddl

I'm doing some automatic script of few queries in hive and we found that we need time to time clear the data from a table and insert the new one. And we are thinking what could be faster?

INSERT OVERWRITE TABLE SOME_TABLE
    SELECT * FROM OTHER_TABLE;

or is faster to do like this:

DROP TABLE SOME_TABLE;
CREATE TABLE SOME_TABLE (STUFFS);
INSERT INTO TABLE
    SELECT * FROM OTHER_TABLE;

The overhead of running the queries is not an issue. Due to we have the script o creation too. The question is, the INSERT OVERWRITE with billion of rows is faster than DROP + CREATE + INSERT INTO?

831

asked Sep 21 '16 13:09

Thiago Baldim

Video Answer

2 Answers

One edge consideration is that if your schema changes, INSERT OVERWRITE will fail, while DROP+CREATE+INSERT will not. While this is unlikely to apply in most scenarios, if you're prototyping workflow/table schemas then it might be worth considering.

125

answered Sep 17 '22 16:09

Brendan

For maximum speed I would suggest to 1) issue hadoop fs -rm -r -skipTrash table_dir/* first to remove old data fast without putting files into trash because INSERT OVERWRITE will put all files into Trash and for very big table this will take a lot of time. Then 2) do INSERT OVERWRITE command. This will be faster also because you do not need to drop/create table.

UPDATE:

As of Hive 2.3.0 (HIVE-15880), if the table has TBLPROPERTIES ("auto.purge"="true") the previous data of the table is not moved to Trash when INSERT OVERWRITE query is run against the table. This functionality is applicable only for managed tables. So, INSERT OVERWRITE with auto purge will work faster than rm -skipTrash + INSERT OVERWRITE or DROP+CREATE+INSERT because it will be a single Hive-only command.

answered Sep 19 '22 16:09

leftjoin

Related questions
                            
                                How to use hive with other user
                            
                                Do I need an else clause in a case expression?
                            
                                SQL most recent using row_number() over partition
                            
                                How does Hive stores data and what is SerDe?
                            
                                How to order by count desc in each group in a hive?
                            
                                java.sql.SQLException: No suitable driver found for jdbc:hive://localhost:10000/default
                            
                                generating unique ids in hive
                            
                                creating partition in external table in hive
                            
                                Hive INSERT OVERWRITE DIRECTORY command output is not separated by a delimiter. Why?
                            
                                Error happening while using json_tuple syntax in hive script
                            
                                ClassNotFoundException: org.apache.spark.SparkConf with spark on hive
                            
                                Distinct on specific column in Hive
                            
                                Hive partitioned table reads all the partitions despite having a Spark filter
                            
                                How to make R tm corpus of 100 million tweets?
                            
                                Distinct on Multiple columns in Hive
                            
                                hive - how to drop external hive table along with data
                            
                                Spark SQL saveAsTable is not compatible with Hive when partition is specified
                            
                                Any way to compute statistics on a hive table for all partitions with a single analyze command?
                            
                                Hive Query- Joining two tables on three joining conditions with OR operator
                            
                                Hive: How to test and find for null map entries?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With