Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JPA: what is the proper pattern for iterating over large result sets?

People also ask

What is the use of @query in JPA?

Understanding the @Query Annotation The @Query annotation can only be used to annotate repository interface methods. The call of the annotated methods will trigger the execution of the statement found in it, and their usage is pretty straightforward. The @Query annotation supports both native SQL and JPQL.

Can JPA return results as a map?

There is no standard way to get JPA to return a map.


Page 537 of Java Persistence with Hibernate gives a solution using ScrollableResults, but alas it's only for Hibernate.

So it seems that using setFirstResult/setMaxResults and manual iteration really is necessary. Here's my solution using JPA:

private List<Model> getAllModelsIterable(int offset, int max)
{
    return entityManager.createQuery("from Model m", Model.class).setFirstResult(offset).setMaxResults(max).getResultList();
}

then, use it like this:

private void iterateAll()
{
    int offset = 0;

    List<Model> models;
    while ((models = Model.getAllModelsIterable(offset, 100)).size() > 0)
    {
        entityManager.getTransaction().begin();
        for (Model model : models)
        {
            log.info("do something with model: " + model.getId());
        }

        entityManager.flush();
        entityManager.clear();
        em.getTransaction().commit();
        offset += models.size();
    }
}

I tried the answers presented here, but JBoss 5.1 + MySQL Connector/J 5.1.15 + Hibernate 3.3.2 didn't work with those. We've just migrated from JBoss 4.x to JBoss 5.1, so we've stuck with it for now, and thus the latest Hibernate we can use is 3.3.2.

Adding couple of extra parameters did the job, and code like this runs without OOMEs:

        StatelessSession session = ((Session) entityManager.getDelegate()).getSessionFactory().openStatelessSession();

        Query query = session
                .createQuery("SELECT a FROM Address a WHERE .... ORDER BY a.id");
        query.setFetchSize(Integer.valueOf(1000));
        query.setReadOnly(true);
        query.setLockMode("a", LockMode.NONE);
        ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
        while (results.next()) {
            Address addr = (Address) results.get(0);
            // Do stuff
        }
        results.close();
        session.close();

The crucial lines are the query parameters between createQuery and scroll. Without them the "scroll" call tries to load everything into memory and either never finishes or runs to OutOfMemoryError.


You can't really do this in straight JPA, however Hibernate has support for stateless sessions and scrollable result sets.

We routinely process billions of rows with its help.

Here is a link to documentation: http://docs.jboss.org/hibernate/core/3.3/reference/en/html/batch.html#batch-statelesssession


To be honest, I would suggest leaving JPA and stick with JDBC (but certainly using JdbcTemplate support class or such like). JPA (and other ORM providers/specifications) is not designed to operate on many objects within one transaction as they assume everything loaded should stay in first-level cache (hence the need for clear() in JPA).

Also I am recommending more low level solution because the overhead of ORM (reflection is only a tip of an iceberg) might be so significant, that iterating over plain ResultSet, even using some lightweight support like mentioned JdbcTemplate will be much faster.

JPA is simply not designed to perform operations on a large amount of entities. You might play with flush()/clear() to avoid OutOfMemoryError, but consider this once again. You gain very little paying the price of huge resource consumption.


If you use EclipseLink I' using this method to get result as Iterable

private static <T> Iterable<T> getResult(TypedQuery<T> query)
{
  //eclipseLink
  if(query instanceof JpaQuery) {
    JpaQuery<T> jQuery = (JpaQuery<T>) query;
    jQuery.setHint(QueryHints.RESULT_SET_TYPE, ResultSetType.ForwardOnly)
       .setHint(QueryHints.SCROLLABLE_CURSOR, true);

    final Cursor cursor = jQuery.getResultCursor();
    return new Iterable<T>()
    {     
      @SuppressWarnings("unchecked")
      @Override
      public Iterator<T> iterator()
      {
        return cursor;
      }
    }; 
   }
  return query.getResultList();  
}  

close Method

static void closeCursor(Iterable<?> list)
{
  if (list.iterator() instanceof Cursor)
    {
      ((Cursor) list.iterator()).close();
    }
}

It depends upon the kind of operation you have to do. Why are you looping over a million of row? Are you updating something in batch mode? Are you going to display all records to a client? Are you computing some statistics upon the retrieved entities?

If you are going to display a million records to the client, please reconsider your user interface. In this case, the appropriate solution is paginating your results and using setFirstResult() and setMaxResult().

If you have launched an update of a large amount of records, you'll better keep the update simple and use Query.executeUpdate(). Optionally, you can execute the update in asynchronous mode using a Message-Driven Bean o a Work Manager.

If you are computing some statistics upon the retrieved entities, you can take advantage on the grouping functions defined by the JPA specification.

For any other case, please be more specific :)


There is no "proper" what to do this, this isn't what JPA or JDO or any other ORM is intended to do, straight JDBC will be your best alternative, as you can configure it to bring back a small number of rows at a time and flush them as they are used, that is why server side cursors exist.

ORM tools are not designed for bulk processing, they are designed to let you manipulate objects and attempt to make the RDBMS that the data is stored in be as transparent as possible, most fail at the transparent part at least to some degree. At this scale, there is no way to process hundreds of thousands of rows ( Objects ), much less millions with any ORM and have it execute in any reasonable amount of time because of the object instantiation overhead, plain and simple.

Use the appropriate tool. Straight JDBC and Stored Procedures definitely have a place in 2011, especially at what they are better at doing versus these ORM frameworks.

Pulling a million of anything, even into a simple List<Integer> is not going to be very efficient regardless of how you do it. The correct way to do what you are asking is a simple SELECT id FROM table, set to SERVER SIDE ( vendor dependent ) and the cursor to FORWARD_ONLY READ-ONLY and iterate over that.

If you are really pulling millions of id's to process by calling some web server with each one, you are going to have to do some concurrent processing as well for this to run in any reasonable amount of time. Pulling with a JDBC cursor and placing a few of them at a time in a ConcurrentLinkedQueue and having a small pool of threads ( # CPU/Cores + 1 ) pull and process them is the only way to complete your task on a machine with any "normal" amount of RAM, given you are already running out of memory.

See this answer as well.