I'm learning Hadoop/Big data technologies. I would like to ingest data in bulk into hive. I started working with a simple CSV file and when I tried to use INSERT
command to load each record by record, one record insertion itself took around 1 minute. When I put the file into HDFS and then used the LOAD
command, it was instantaneous since it just copies the file into hive's warehouse. I just want to know what are the trade offs that one have to face when they opt in towards LOAD
instead of INSERT
.
Load- Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
Insert-Query Results can be inserted into tables by using the insert clause and which in turn runs the map reduce jobs.So it takes some time to execute.
In case if you want to optimize/tune the insert statements.Below are some techniques:
1.Setting the execution Engine in hive-site.xml to Tez(if its already installed)
set hive.execution.engine=tez;
2.USE ORCFILE
CREATE TABLE A_ORC (
customerID int, name string, age int, address string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);
INSERT INTO TABLE A_ORC SELECT * FROM A;
3. Concurrent job runs in hive can save the overall job running time .To achieve that hive-default.xml,below config needs to be changed:
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=<your value>;
For more info,you can visit http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With