So I have a main table in Hive, it will store all my data.
I want to be able to load a incremental data update about every month with a large amount of data couple billion rows. There will be new data as well as updated entries.
What is the best way to approach this, I know Hive recently upgrade and supports update/insert/delete.
What I've been thinking is to somehow find the entries that will be updated and remove them from the main table and then just insert the new incremental update. However after trying this, the inserts are very fast, but the deletes are very slow.
The other way is to do something using the update statement to match the key values from the main table and the incremental update and update their fields. I haven't tried this yet. This also sounds painfully slow since Hive would have to update each entry 1 by 1.
Anyone got any ideas as to how to do this most efficiently and effectively ?? I'm pretty new to Hive and databases in general.
If your Hive version is anything but above version 2.2, you can use MERGE statement to perform incremental load. The MERGE statement will first check if row is available in Hive table. It will be updated if it is available, otherwise new record will be inserted.
In the hive, Incremental load is generally used to implement slowly changing dimensions. When you migrate your data to the Hadoop Hive, you might usually keep the slowly changing tables to sync up tables with the latest data.
We can use Sqoop incremental import command with “-merge-key” option for updating the records in an already imported Hive table. --incremental lastmodified will import the updated and new records from RDBMS (MySQL) database based on last latest value of emp_timestamp in Hive.
Hive has following limitations: 1) It does not allow row level inserts,updates or deletes.It only provides the option of dropping of table when it comes to deleting something because behind the scenes hive works with files on HDFS.
If merge in ACID mode is not applicable, then it's possible to update using FULL OUTER JOIN or using UNION ALL + row_number. To find all entries that will be updated you can join increment data with old data:
insert overwrite target_data [partition() if applicable]
SELECT
--select new if exists, old if not exists
case when i.PK is not null then i.PK else t.PK end as PK,
case when i.PK is not null then i.COL1 else t.COL1 end as COL1,
...
case when i.PK is not null then i.COL_n else t.COL_n end as COL_n
FROM
target_data t --restrict partitions if applicable
FULL JOIN increment_data i on (t.PK=i.PK);
It's possible to optimize this by restricting partitions in target_data that will be overwritten and joined using WHERE partition_col in (select distinct partition_col from increment_data)
or pass partition list if possible as a parameter and use in the where clause, it will work even faster.
Also if you want to update all columns with new data, you can apply this solution with UNION ALL+row_number()
, it works faster than full join: https://stackoverflow.com/a/44755825/2700344
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With