I have an external table mapped in Hive (v2.3.2 on EMR-5.11.0) that I need to update with new data around once a week. The merge consists of a conditional upsert statement.
The table's location is in s3, and the data is always there (created once, and we just need to update it with new data).
I've read this blog about merging data in Hive using the ACID feature on transactional tables (https://dzone.com/articles/update-hive-tables-the-easy-way-part-2-hortonworks), but as far as I can see, the only solution is to copy my external table to a temporary Hive internal table, that is clustered and transactional, then only on that table I can do the merge and override my original data with the new merged one.
This table is quite big (around 10GB of data), so I'd like to avoid copying it before every merge operation.
Is there a way to create the internal table and have it mapped to existing data? or is there another way, besides the merge statement, to perform an upsert on Hive external tables?
Thanks a lot in advance!
if you want to use merge for upsert, then you have to have data in managed hive internal table, simple solution if you don't want to copy file
create similar structure internal table
create table table1(
id int,
Name string,
UpdatedDate date
)
CLUSTERED BY (id) INTO 2 BUCKETS STORED AS ORC
tblproperties("transactional"="true","skip.header.line.count"="1");
insert data into internal table from external table
INSERT INTO TABLE table1 SELECT * FROM your external table;
run your merge statement to update internal table
set hive.support.concurrency = true; set hive.enforce.bucketing = true; set hive.exec.dynamic.partition.mode = nonstrict; set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; set hive.compactor.initiator.on = true; set hive.compactor.worker.threads = 1; merge into table1 using DailyFeed on DailyFeed.id= table1.id when matched and table1.name<>DailyFeed.name then update set name= DailyFeed.name when not matched then insert values (DailyFeed.id, DailyFeed.name, CURRENT_DATE) ;
overwrite your base file export internal table
hive -e 'set hive.cli.print.header=true; select * from table1' | sed 's/[[:space:]]+/,/g' > /home/user1/table1.csv
Or
insert overwrite local directory '/home/user1/table1.csv' row format delimited fields terminated by ',' SELECT * FROM table1;
Hope this will help to resolve your issue
Thanks
Nilesh
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With