Hive: Best way to do incremetal updates on a main table

Tags:

So I have a main table in Hive, it will store all my data.

I want to be able to load a incremental data update about every month with a large amount of data couple billion rows. There will be new data as well as updated entries.

What is the best way to approach this, I know Hive recently upgrade and supports update/insert/delete.

What I've been thinking is to somehow find the entries that will be updated and remove them from the main table and then just insert the new incremental update. However after trying this, the inserts are very fast, but the deletes are very slow.

The other way is to do something using the update statement to match the key values from the main table and the incremental update and update their fields. I haven't tried this yet. This also sounds painfully slow since Hive would have to update each entry 1 by 1.

Anyone got any ideas as to how to do this most efficiently and effectively ?? I'm pretty new to Hive and databases in general.

468

asked Jun 08 '16 17:06

Alex

1 Answers

If merge in ACID mode is not applicable, then it's possible to update using FULL OUTER JOIN or using UNION ALL + row_number. To find all entries that will be updated you can join increment data with old data:

insert overwrite target_data [partition() if applicable]
SELECT
  --select new if exists, old if not exists
  case when i.PK is not null then i.PK   else t.PK   end as PK,
  case when i.PK is not null then i.COL1 else t.COL1 end as COL1,
  ... 
  case when i.PK is not null then i.COL_n else t.COL_n end as COL_n
  FROM 
      target_data t --restrict partitions if applicable
      FULL JOIN increment_data i on (t.PK=i.PK);

It's possible to optimize this by restricting partitions in target_data that will be overwritten and joined using WHERE partition_col in (select distinct partition_col from increment_data) or pass partition list if possible as a parameter and use in the where clause, it will work even faster.

Also if you want to update all columns with new data, you can apply this solution with UNION ALL+row_number(), it works faster than full join: https://stackoverflow.com/a/44755825/2700344

184

answered Oct 20 '22 20:10

leftjoin

Related questions
                            
                                MongoDB trying to connect to localhost, Why?
                            
                                how to get the sub aggregation from elasticsearch
                            
                                OkHttpClient cannot cancel Call by tag
                            
                                Casting a list during a stream operation
                            
                                Password validate 8 digits, contains upper, lowercase, and a special character
                            
                                Spring-boot CrudRepository autowiring error
                            
                                IntelliJ startup error - Internal error. Please report to http://jb.gg/ide/critical-startup-errors [duplicate]
                            
                                How to return a Set in Java
                            
                                OSGI Import-Package: version vs bundle-version - what's the difference?
                            
                                String array to a collection of Integer?
                            
                                Java Function conversion to Kotlin fails
                            
                                How to read an Environment variable from Java code
                            
                                UnsatisfiedLinkError in exported (Eclipse) executable jar file
                            
                                Java implementation of Producer Consumer throws java.lang.IllegalMonitorStateException
                            
                                How long did it take to run an Observable using RxJava (ReactiveX)?
                            
                                How to DRY these block of code in Java?
                            
                                How to get identical byte[] in C# and Java
                            
                                JPA @SqlResultSetMapping not able to handle empty sql result to be mapped to empty POJO - instead an exception is thrown
                            
                                How to improve performance of JavaFX graphic drawing?
                            
                                In MVP structure which class responsible for keep list items and how to notify data change in this

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hive: Best way to do incremetal updates on a main table

Tags:

java

merge

hadoop

hive

hiveql

Alex

People also ask

1 Answers

leftjoin

Recent Activity

Donate For Us