Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Doctrine Batch Processing Iterate High Memory Usage

Batch processing with doctrine is trickier than it seems, even with the help of iterate() and IterableResult.

Just as you expected greatest benefit of IterableResult is that it does not load all of the elements into memory, and the second benefit is that it doesn't hold references to the entities you load, thus IterableResult doesn't prevent GC from freeing memory from your entity.

However there's another object Doctrine's EntityManager (more specifically UnitOfWork) which holds all the references to each object which you queried explicitly or implicitly (EAGER associations).

In simple words, whenever you get any entity(ies) returned by findAll() findOneBy() even through DQL queries and also IterableResult, then a reference to each of those entities is saved inside of doctrine. The reference is simply stored in an assoc array, here's pseudocode: $identityMap['Acme\Entities\Image'][0] = $image0;

So because upon each iteration of your loop, your previous images (despite not being present in the loop's scope or IterableResult's scope) are still present inside of this identityMap, GC cannot clean them and your memory consumption is the same as when you were calling findAll().

Now let's go through the code and see what is actually happening

$query = $this->em->createQuery('SELECT i FROM Acme\Entities\Image i'); 

// here doctrine only creates Query object, no db access here

$iterable = $query->iterate(); 

// unlike findAll(), upon this call no db access happens. // Here the Query object is simply wrapped in an Iterator

while (($image_row = $iterable->next()) !== false) {  
    // now upon the first call to next() the DB WILL BE ACCESSED FOR THE FIRST TIME
    // the first resulting row will be returned
    // row will be hydrated into Image object
    // ----> REFERENCE OF OBJECT WILL BE SAVED INSIDE $identityMap <----
    // the row will be returned to you via next()

    // to access actual Image object, you need to take [0]th element of the array                            


     $image = $image_row[0];
    // Do something here!
     write_image_data_to_file($image,'myimage.data.bin');

    //now as the loop ends, the variables $image (and $image_row) will go out of scope 
    // and from what we see should be ready for GC
    // however because reference to this specific image object is still held
    // by the EntityManager (inside of $identityMap), GC will NOT clean it 
}
// and by the end of your loop you will consume as much memory
// as you would have by using `findAll()`.

So the first solution is to actually tell Doctrine EntityManager to detach the object from the $identityMap. I also replaced while loop to foreach to make it more readable.

foreach($iterable as $image_row){
    $image = $image_row[0]; 

    // do something with the image
    write_image_data_to_file($image);

    $entity_manager->detach($image);
    // this line will tell doctrine to remove the _reference_to_the_object_ 
    // from identity map. And thus object will be ready for GC
}

However the example above has few flaws, even though it is featured in the doctrine's documentation on batch processing. It works well, in case your entity Image isn't performing EAGER load for any of it's associations. But if you're EAGERly loading any of the associations eg. :

/*
  @ORM\Entity
*/
class Image {

  /* 
    @ORM\Column(type="integer")
    @ORM\Id 
   */
  private $id;

  /*
    @ORM\Column(type="string")
  */
  private $imageName;

  /*
   @ORM\ManyToOne(targetEntity="Acme\Entity\User", fetch="EAGER")
   This association will be automatically (EAGERly) loaded by doctrine
   every time you query from db Image entity. Whether by findXXX(),DQL or iterate()
  */
  private $owner;

  // getters/setters left out for clarity
}

So if we use same piece of the code as above, upon

foreach($iterable as $image_row){
    $image = $image_row[0]; 
    // here becuase of EAGER loading, we already have in memory owner entity
    // which can be accessed via $image->getOwner() 

    // do something with the image
    write_image_data_to_file($image);

    $entity_manager->detach($image);
    // here we detach Image entity, but `$owner` `User` entity is still
    // referenced in the doctrine's `$identityMap`. Thus we are leaking memory still.

}

The possible solution can be to use EntityManager::clear() instead or EntityManager::detach() which will clear COMPLETELY the identity map.

foreach($iterable as $image_row){
    $image = $image_row[0]; 
    // here becuase of EAGER loading, we already have in memory owner entity
    // which can be accessed via $image->getOwner() 

    // do something with the image
    write_image_data_to_file($image);

    $entity_manager->clear();
    // now ``$identityMap` will be cleared of ALL entities it has
    // the `Image` the `User` loaded in this loop iteration and as as
    // SIDE EFFECT all OTHER Entities which may have been loaded by you
    // earlier. Thus you when you start this loop you must NOT rely
    // on any entities you have `persist()`ed or `remove()`ed 
    // all changes since the last `flush()` will be lost.

}

So hope this helps to understand doctrine iteration a little bit.


If you combine doctrine iterate() with batching strategy, you should be able to iterate over large records.

For example:


$batchSize = 1000;
$numberOfRecordsPerPage = 5000;

$totalRecords = $queryBuilder->select('count(u.id)')
            ->from('SELECT i FROM Acme\Entities\Image i')
            ->getQuery()
            ->getSingleScalarResult();   //Get total records to iterate on

        $totalProcessed = 0;

        $processing = true;

        while ($processing) {
            $query = $entityManager->createQuery('SELECT i FROM Acme\Entities\Image i')
                ->setMaxResults($numberOfRecordsPerPage) //Maximum records to fetch at a time
                ->setFirstResult($totalProcessed);
          
             $iterableResult = $query->iterate();
          
            while (($row = $iterableResult->next()) !== false) {
                $image = $row[0];
                $image->updateSomethingImportant();
              
                 if (($totalProcessed % $batchSize ) === 0) {
                    $entityManager->flush();
                    $entityManager->clear();
                }
                $totalProcessed++;
            }
            if ($totalProcessed === $totalRecords) {
                break;
            }
        }

    $entityManager->flush();


See Iterating large data in doctrine 2


I strongly believe the batch processing with Doctrine or any kind of iterations with MySQL (PDO or mysqli) are just an illusion.

@dimitri-k provided a nice explanation especially about unit of work. The problem is the miss leading: "$query->iterate()" which doesn't really iterate over the data source. It's just an \Traversable wrapper around already fully fetched data source.

An example demonstrating that even removing Doctrine abstraction layer completely from the picture, we will still run into memory issues:

echo 'Starting with memory usage: ' . memory_get_usage(true) / 1024 / 1024 . " MB \n";

$pdo  = new \PDO("mysql:dbname=DBNAME;host=HOST", "USER", "PW");
$stmt = $pdo->prepare('SELECT * FROM my_big_table LIMIT 100000');
$stmt->execute();

while ($rawCampaign = $stmt->fetch()) {
    // echo $rawCampaign['id'] . "\n";
}

echo 'Ending with memory usage: ' . memory_get_usage(true) / 1024 / 1024 . " MB \n";

Output:

Starting with memory usage: 6 MB 
Ending with memory usage: 109.46875 MB

Here, the disappointing getIterator() method:

namespace Doctrine\DBAL\Driver\Mysqli\MysqliStatement

/**
 * {@inheritdoc}
 */
public function getIterator()
{
    $data = $this->fetchAll();

    return new \ArrayIterator($data);
}

You can use my little library to actually stream heavy tables using PHP Doctrine or DQL or just pure SQL. However you find appropriate: https://github.com/EnchanterIO/remote-collection-stream


tl;dr;

When running the command use --no-debug or set the Sql logger to null to prevent it saving all queries it runs.

Every then and now use EntityManager::clear(), memory leak will go down to almost zero.