Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive insert vs Hive Load: What are the trade offs?

I'm learning Hadoop/Big data technologies. I would like to ingest data in bulk into hive. I started working with a simple CSV file and when I tried to use INSERT command to load each record by record, one record insertion itself took around 1 minute. When I put the file into HDFS and then used the LOAD command, it was instantaneous since it just copies the file into hive's warehouse. I just want to know what are the trade offs that one have to face when they opt in towards LOAD instead of INSERT.

like image 367
Sivaprasanna Sethuraman Avatar asked Jan 26 '17 16:01

Sivaprasanna Sethuraman


1 Answers

Load- Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.

Insert-Query Results can be inserted into tables by using the insert clause and which in turn runs the map reduce jobs.So it takes some time to execute.

In case if you want to optimize/tune the insert statements.Below are some techniques:

1.Setting the execution Engine in hive-site.xml to Tez(if its already installed)

 set hive.execution.engine=tez;

2.USE ORCFILE

CREATE TABLE A_ORC (

customerID int, name string, age int, address string

) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);

INSERT INTO TABLE A_ORC SELECT * FROM A;

3. Concurrent job runs in hive can save the overall job running time .To achieve that hive-default.xml,below config needs to be changed:

set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=<your value>;

For more info,you can visit http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ Hope this helps.

like image 191
Subash Avatar answered Sep 22 '22 05:09

Subash