Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hibernate batch-delete vs single delete

EDIT: Based on some of my debugging and logging, I think the question boils down to why is DELETE FROM table WHERE id = x much faster than DELETE FROM table WHERE id IN (x) where x is just a single ID.

I recently tested batch-delete versus deleting each row one by one and noticed that batch-delete was much slower. The table had triggers for delete, update, and insert but I've tested with and without the triggers and each time batch-delete was slower. Can anyone shed some light on why this is the case or share tips on how I can debug this? From what I understand, I can't really reduce the number of times the trigger activates but I had originally figured that lowering the number of "delete" query would help with the performance.

I've included some information below, please let me know if I've left out anything relevant.

Deletion are done in batches of 10,000 and the code look something like :

private void batchDeletion( Collection<Long> ids ) {
  StringBuilder sb = new StringBuilder();
  sb.append( "DELETE FROM ObjImpl WHERE id IN (:ids)" );

  Query sql = getSession().createQuery( sb.toString() );
  sql.setParameterList( "ids", ids );

  sql.executeUpdate();
}

The code to delete just a single row is basically:

SessionFactory.getCurrentSession().delete(obj);

The table has two indexes which is not used in any of the deletion. No cascade operation will occur.

Here is a sample of the EXPLAIN ANALYZE of DELETE FROM table where id IN ( 1, 2, 3 );:

Delete on table  (cost=12.82..24.68 rows=3 width=6) (actual time=0.143..0.143 rows=0 loops=1)
  ->  Bitmap Heap Scan on table  (cost=12.82..24.68 rows=3 width=6) (actual time=0.138..0.138 rows=0 loops=1)
        Recheck Cond: (id = ANY ('{1,2,3}'::bigint[]))
        ->  Bitmap Index Scan on pk_table  (cost=0.00..12.82 rows=3 width=0) (actual time=0.114..0.114 rows=0 loops=1)
              Index Cond: (id = ANY ('{1,2,3}'::bigint[]))
Total runtime: 3.926 ms

I've vacuumed and reindexed each time I reload my data for testing and my test data contains 386,660 rows.

The test is to delete all the rows and I'm not using TRUNCATE because normally there's a selection criteria but for testing purposes, I've made the criteria include all rows. With triggers enabled, deleting each row one by one took 193,616ms whereas batch-delete took 285,558ms. I then disabled the triggers and got 93,793ms for single row deletes and 181,537ms for batch-delete. The trigger goes and sums up values and updates another table - basically bookkeeping.

I've played around with lower batch sizes (100 and 1) and they all seem to perform worse.

EDIT: Turned on Hibernate logging and for single row by row deletes, it's basically doing: delete from table where id=? and the EXPLAIN ANALYZE:

Delete on table  (cost=0.00..8.31 rows=1 width=6) (actual time=0.042..0.042 rows=0 loops=1)
  ->  Index Scan using pk_table on table  (cost=0.00..8.31 rows=1 width=6) (actual time=0.037..0.037 rows=0 loops=1)
        Index Cond: (id = 3874904)
Total runtime: 0.130 ms

EDIT: Was curious if the list actually contained 10,000 ID, if Postgres would do something different: nope.

Delete on table  (cost=6842.01..138509.15 rows=9872 width=6) (actual time=17.170..17.170 rows=0 loops=1)
  ->  Bitmap Heap Scan on table  (cost=6842.01..138509.15 rows=9872 width=6) (actual time=17.160..17.160 rows=0 loops=1)
        Recheck Cond: (id = ANY ('{NUMBERS 1 THROUGH 10,000}'::bigint[]))
        ->  Bitmap Index Scan on pk_table  (cost=0.00..6839.54 rows=9872 width=0) (actual time=17.139..17.139 rows=0 loops=1)
              Index Cond: (id = ANY ('{NUMBERS 1 THROUGH 10,000}'::bigint[]))
Total runtime: 17.391 ms

EDIT: Based on the EXPLAIN ANALYZE of the above, I've retrieved some logging from the actual delete operations. Below is logging of two variation of single row by row deletes.

Here are some single deletes:

2013-03-14 13:09:25,424:delete from table where id=?
2013-03-14 13:09:25,424:delete from table where id=?
2013-03-14 13:09:25,424:delete from table where id=?
2013-03-14 13:09:25,424:delete from table where id=?
2013-03-14 13:09:25,424:delete from table where id=?
2013-03-14 13:09:25,424:delete from table where id=?
2013-03-14 13:09:25,424:delete from table where id=?
2013-03-14 13:09:25,424:delete from table where id=?
2013-03-14 13:09:25,424:delete from table where id=?
2013-03-14 13:09:25,424:delete from table where id=?

Here is the other variation of single deletes (the list is just 1 item)

2013-03-14 13:49:59,858:delete from table where id in (?)
2013-03-14 13:50:01,460:delete from table where id in (?)
2013-03-14 13:50:03,040:delete from table where id in (?)
2013-03-14 13:50:04,544:delete from table where id in (?)
2013-03-14 13:50:06,125:delete from table where id in (?)
2013-03-14 13:50:07,707:delete from table where id in (?)
2013-03-14 13:50:09,275:delete from table where id in (?)
2013-03-14 13:50:10,833:delete from table where id in (?)
2013-03-14 13:50:12,369:delete from table where id in (?)
2013-03-14 13:50:13,873:delete from table where id in (?)

Both are IDs that exists in the table and should be sequential.


EXPLAIN ANALYZE of DELETE FROM table WHERE id = 3774887;

Delete on table  (cost=0.00..8.31 rows=1 width=6) (actual time=0.097..0.097 rows=0 loops=1)
  ->  Index Scan using pk_table on table  (cost=0.00..8.31 rows=1 width=6) (actual time=0.055..0.058 rows=1 loops=1)
        Index Cond: (id = 3774887)
Total runtime: 0.162 ms

EXPLAIN ANALYZE of DELETE FROM table WHERE id IN (3774887);

Delete on table  (cost=0.00..8.31 rows=1 width=6) (actual time=0.279..0.279 rows=0 loops=1)
  ->  Index Scan using pk_table on table  (cost=0.00..8.31 rows=1 width=6) (actual time=0.210..0.213 rows=1 loops=1)
        Index Cond: (id = 3774887)
Total runtime: 0.452 ms

0.162 vs 0.452 considered significant difference?

EDIT:

Set batch size to 50,000 and Hibernate didn't like that idea:

java.lang.StackOverflowError
        at org.hibernate.hql.ast.util.NodeTraverser.visitDepthFirst(NodeTraverser.java:40)
        at org.hibernate.hql.ast.util.NodeTraverser.visitDepthFirst(NodeTraverser.java:41)
        at org.hibernate.hql.ast.util.NodeTraverser.visitDepthFirst(NodeTraverser.java:42)
....
like image 592
nevets1219 Avatar asked Mar 14 '13 00:03

nevets1219


People also ask

What does batch delete mean?

Users can delete records from the database either individually or in batches. When an entity record is deleted, all links to that entity are also deleted - the link end entities are not deleted. Deleting records is a permanent and irreversible operation unless soft delete is enabled for your database.

How can I delete multiple rows in hibernate?

class, new Long(temp[i])); if(i%50==0) { session. flush(); session. clear(); } session. delete(c); } //session.

What is remove method in hibernate?

In JavaDoc of Session class the description of delete method is: Remove a persistent instance from the datastore. The argument may be an instance associated with the receiving Session or a transient instance with an identifier associated with existing persistent state.


1 Answers

Ok, the first thing you have to note is that SQL has to get converted into a plan in some way. Your EXPLAIN results suggest that the logic here is fundamentally different for an equality compared to an IN(vals) construct.

WHERE id = 1;

Is transformed to a simple equality filter.

WHERE id IN (1);

Is transformed into an array match of:

WHERE id = ANY(ARRAY[1]);

Apparently the planner is not smart enough to notice that these are mathematically identical where an array has exactly one member. So what it's doing is planning for an array of any size which is why you get the nested loop bitmap index scan.

What's interesting here is not just that it is slower but that performance holds out better for the most part. So with one member in the in() clause, it is 40 times slower, and with 10000 members, it is only 170 times slower, but that also means that the 10000 member version is also 50 times faster than 10000 separate index scans on id.

So what is happening here is that the planner is selecting a plan which performs better when there are a large number of id's checked but performs more poorly when there are only a few.

like image 100
Chris Travers Avatar answered Oct 06 '22 20:10

Chris Travers