Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle large dataset with JPA (or at least with Hibernate)?

I need to make my web-app work with really huge datasets. At the moment I get either OutOfMemoryException or output which is being generated 1-2 minutes.

Let's put it simple and suppose that we have 2 tables in DB: Worker and WorkLog with about 1000 rows in the first one and 10 000 000 rows in the second one. Latter table has several fields including 'workerId' and 'hoursWorked' fields among others. What we need is:

  1. count total hours worked by each user;

  2. list of work periods for each user.

The most straightforward approach (IMO) for each task in plain SQL is:

1)

select Worker.name, sum(hoursWorked) from Worker, WorkLog 
   where Worker.id = WorkLog.workerId 
   group by Worker.name;

//results of this query should be transformed to Multimap<Worker, Long>

2)

select Worker.name, WorkLog.start, WorkLog.hoursWorked from Worker, WorkLog
   where Worker.id = WorkLog.workerId;

//results of this query should be transformed to Multimap<Worker, Period>
//if it was JDBC then it would be vitally 
//to set resultSet.setFetchSize (someSmallNumber), ~100

So, I have two questions:

  1. how to implement each of my approaches with JPA (or at least with Hibernate);
  2. how would you handle this problem (with JPA or Hibernate of course)?
like image 453
Roman Avatar asked May 03 '10 22:05

Roman


2 Answers

suppose that we have 2 tables in DB: Worker and WorkLog with about 1000 rows in the first one and 10 000 000 rows in the second one

For high volumes like this, my recommendation would be to use The StatelessSession interface from Hibernate:

Alternatively, Hibernate provides a command-oriented API that can be used for streaming data to and from the database in the form of detached objects. A StatelessSession has no persistence context associated with it and does not provide many of the higher-level life cycle semantics. In particular, a stateless session does not implement a first-level cache nor interact with any second-level or query cache. It does not implement transactional write-behind or automatic dirty checking. Operations performed using a stateless session never cascade to associated instances. Collections are ignored by a stateless session. Operations performed via a stateless session bypass Hibernate's event model and interceptors. Due to the lack of a first-level cache, Stateless sessions are vulnerable to data aliasing effects. A stateless session is a lower-level abstraction that is much closer to the underlying JDBC.

StatelessSession session = sessionFactory.openStatelessSession();
Transaction tx = session.beginTransaction();

ScrollableResults customers = session.getNamedQuery("GetCustomers")
    .scroll(ScrollMode.FORWARD_ONLY);
while ( customers.next() ) {
    Customer customer = (Customer) customers.get(0);
    customer.updateStuff(...);
    session.update(customer);
}

tx.commit();
session.close();

In this code example, the Customer instances returned by the query are immediately detached. They are never associated with any persistence context.

The insert(), update() and delete() operations defined by the StatelessSession interface are considered to be direct database row-level operations. They result in the immediate execution of a SQL INSERT, UPDATE or DELETE respectively. They have different semantics to the save(), saveOrUpdate() and delete() operations defined by the Session interface.

like image 173
Pascal Thivent Avatar answered Oct 17 '22 08:10

Pascal Thivent


It seems you can do this with EclipseLink too. Check this : http://wiki.eclipse.org/EclipseLink/Examples/JPA/Pagination :

Query query = em.createQuery...
query.setHint(QueryHints.CURSOR, true)
     .setHint(QueryHints.SCROLLABLE_CURSOR, true)
ScrollableCursor scrl = (ScrollableCursor)q.getSingleResult();
Object o = null;
while ((o = scrl.next()) != null) { ... }
like image 31
lapsus63 Avatar answered Oct 17 '22 07:10

lapsus63