Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hibernate Mass Insert/update : is this a good approach?

I am currently doing performance and also memory tuning in our hibernate based app for large bulk/batch imports. We are basically importing a CSV file with product data where some products are new (insert) and some exist (update).

My focus now is on choosing a strategy to find out which entities to UPDATE and which ones to INSERT, without doing a check (Select if exists) for each row in the CSV file.

My current approach is like this:

  1. build a hashmap of all the objects inside the database.
  2. iterate over CSV and use the hashmap to decide whether to update or insert.

This approach works well and test have proved it is magnitudes faster than doing such a single IF EXISTS check for every row.

My concern is memory size if there are LOTS of entities in the DB.

right now I think about using a slight variation of the approach above and I would like to know opinions. Basically what I want to do is doing multiple batches of IF EXISTS checks with multiple rows (e.g. SELECT FROM table where sku IN (sku1, sku2, sku3) )

Here is some pseudo code:

1. Database contains: db{sku1, sku2,sku3,sku5}

2. file contains: file {sku1, sku2, sku3, sku6}

3. Expected result: 
   updates: {sku1, sku2, sku3}
   inserts{sku6}

4. Algorithm

   have a map to keep database entities which need updates
   updatemap {}
   now iterate over the file in e.g. batches of 2 rows (for demo purposes)
   1st iteration: foreach (select where sku IN (sku1, sku2) limit 2) as elem
    -> updatemap.add(elem)  -> elem is asumed to be a persistent entity here
    -> myDAO.update(elem)   -> executes Spring's getHibernateTemplate().update() under the hood

   -> updatemap contents after 1st loop {sku1, sku2}

   2nd iteration: foreach (select where sku IN (sku3, sku6) limit) 2 as elem
    -> updatemap.add(elem)    
    -> myDAO.update(elem)

   -> updatemap contents after 3nd loop {sku1, sku2, sku3}

btw: I also already assume stuff like (if i % 30 == 0) session.flush; session.clear();

Now we know all elements which were updated. All skus not in updatemap are basically inserts and we can use simple set arithmetic to determine those by doing

file {sku1, sku2, sku3, sku6} - updatemap {sku1, sku2, sku3} = newinserts {sku6}

Now we can go ahead and do inserts for the remaining CSV rows.

Conclusion My assumption is that because of the chunking of the file contents i can limit the amount of memory used. I have more SELECT statements than my initial approach but I have more control over memory usage in case there are thousands of entities in the DB already.

What are your ideas on this? What other efficient approaches exist to find out which entities to update and which to insert in bulk?

like image 316
Christoph Avatar asked Nov 03 '22 17:11

Christoph


1 Answers

I had the exact same problem, involving millions of records, and solved it pretty much exactly as you. A constraint that may not be obvious to a side observer is that we cannot use the regular Hibernate way of load-mutate-update since that would create an inordinate amount of redundant traffic.

On closer reading, my approach differs from yours in that I don't retain any information beyond the processing of a single chunk. I process the chunk in full, including all inserts and updates, before proceeding to the next one. Only that way you have a scalable solution.

The weakest point for me is the use of executeUpdate, which will not use the JDBC batching API. I was planning to make a custom implementation, but for my specific use case it turned out I didn't need to use more than one executeUpdate per chunk.

like image 126
Marko Topolnik Avatar answered Nov 09 '22 07:11

Marko Topolnik