Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge delta data into an external table using hive's merge statement

I have an external table mapped in Hive (v2.3.2 on EMR-5.11.0) that I need to update with new data around once a week. The merge consists of a conditional upsert statement.

The table's location is in s3, and the data is always there (created once, and we just need to update it with new data).

I've read this blog about merging data in Hive using the ACID feature on transactional tables (https://dzone.com/articles/update-hive-tables-the-easy-way-part-2-hortonworks), but as far as I can see, the only solution is to copy my external table to a temporary Hive internal table, that is clustered and transactional, then only on that table I can do the merge and override my original data with the new merged one.

This table is quite big (around 10GB of data), so I'd like to avoid copying it before every merge operation.

Is there a way to create the internal table and have it mapped to existing data? or is there another way, besides the merge statement, to perform an upsert on Hive external tables?

Thanks a lot in advance!

like image 460
Meori Lehr Avatar asked Jan 02 '18 13:01

Meori Lehr


1 Answers

if you want to use merge for upsert, then you have to have data in managed hive internal table, simple solution if you don't want to copy file

  1. create similar structure internal table

    create table table1( id int, Name string,
    UpdatedDate date ) CLUSTERED BY (id) INTO 2 BUCKETS STORED AS ORC tblproperties("transactional"="true","skip.header.line.count"="1");

  2. insert data into internal table from external table

    INSERT INTO TABLE table1 SELECT * FROM your external table;

  3. run your merge statement to update internal table

    set hive.support.concurrency = true; set hive.enforce.bucketing = true; set hive.exec.dynamic.partition.mode = nonstrict; set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; set hive.compactor.initiator.on = true; set hive.compactor.worker.threads = 1; merge into table1 using DailyFeed on DailyFeed.id= table1.id when matched and table1.name<>DailyFeed.name then update set name= DailyFeed.name when not matched then insert values (DailyFeed.id, DailyFeed.name, CURRENT_DATE) ;

  4. overwrite your base file export internal table

hive -e 'set hive.cli.print.header=true; select * from table1' | sed 's/[[:space:]]+/,/g' > /home/user1/table1.csv

Or

insert overwrite local directory '/home/user1/table1.csv' row format delimited fields terminated by ',' SELECT * FROM table1;

Hope this will help to resolve your issue

Thanks

Nilesh

like image 99
Niel Avatar answered Nov 02 '22 12:11

Niel