Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

doctrine2 - How to improve flush efficiency?

I have to update my Doctrine entities to match records inside of (potentionaly very large) XML file. I have to also update ManyToMany associations according to data in the XML. This is what I do inside of a loop:

  1. get data from XML
  2. get entity from DB (if does not exist create new)
  3. set new entity properties
  4. get current entity associations (getter returns ArrayCollection object)
  5. clear all associations (by calling the ArrayCollection::clear())
  6. set new associations (by calling ArrayCollection::add() in sub-loop)
  7. persist entity by EntityManager

After the loop I call EntityManager::flush().

The problem is that flushing generates large amount of queries instead of updating/inserting/deleting multiple rows at once. For every entity are executed following queries:

  • SELECT to get entity from DB
  • UPDATE to update entity properties (this is actually skipped now as no properties changed ... yet)
  • DELETE to clear previous associations
  • INSERT to insert new associations

So in total for 305 records in XML i get 915 queries (I guess it could go up to 1220 queries if all entities would changed) which makes the import very slow.

I could take advantage of IdentityMap and pre-fetch entities before loop, but there are still the UPDATE/DELETE/INSERT queries.

  • Is there a way to let the flush method better optimize queries (use multi-insert, WHERE IN instead of multiple DELETE queries, etc.)?
  • Is this normal behaviour of flush method or am I doing something wrong?
  • Perhaps there is problem in the way how I update the associations of entity. Is there better way how to do this? (instead of "get/clear/add" method)
  • I am aware of that Doctrine is not intended for mass betch processing, but I think using it for XML imports is the best way how to avoid DB inconsitencies which could appear with a not-ORM approach. Is that right?
  • If the approach above is wrong, how should I solve the problem?
like image 634
Petr Peller Avatar asked Jan 03 '12 03:01

Petr Peller


1 Answers

You're doing it right -- it's just slow, because the added abstraction of the ORM means you can't make the sorts of optimizations you'd like.

That said, the EntityManager does get slow on transactions that large. If you don't absolutely need them all in one big transaction, you can probably get more performant code by flush()ing and then clear()ing the EM every 20-200 iterations of your loop.

If that doesn't get you enough performance, the only alternative that I can think of is to revert to custom code that runs custom SQL directly against your DBMS.

I know this isn't a great answer, but at least I can tell you that you're not crazy.

------ edit ------

From official Doctrine2 article on Batch processing:

Some people seem to be wondering why Doctrine does not use multi-inserts (insert into (...) values (...), (...), (...), ...

First of all, this syntax is only supported on mysql and newer postgresql versions. Secondly, there is no easy way to get hold of all the generated identifiers in such a multi-insert when using AUTO_INCREMENT or SERIAL and an ORM needs the identifiers for identity management of the objects. Lastly, insert performance is rarely the bottleneck of an ORM. Normal inserts are more than fast enough for most situations and if you really want to do fast bulk inserts, then a multi-insert is not the best way anyway, i.e. Postgres COPY or Mysql LOAD DATA INFILE are several orders of magnitude faster.

These are the reasons why it is not worth the effort to implement an abstraction that performs multi-inserts on mysql and postgresql in an ORM.

Also there is a significant difference in performance when using remote vs local database as overhead of sending each query to remote server is quite large. The overhead is much lower while using local database thanks to transactions and DB optimizations. (e.g. 70sec lowered to 300ms in the case of example in the question)

like image 102
timdev Avatar answered Oct 20 '22 11:10

timdev