Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Hibernate's ScrollableResults to slowly read 90 million records

I simply need to read each row in a table in my MySQL database using Hibernate and write a file based on it. But there are 90 million rows and they are pretty big. So it seemed like the following would be appropriate:

ScrollableResults results = session.createQuery("SELECT person FROM Person person")             .setReadOnly(true).setCacheable(false).scroll(ScrollMode.FORWARD_ONLY); while (results.next())     storeInFile(results.get()[0]); 

The problem is the above will try and load all 90 million rows into RAM before moving on to the while loop... and that will kill my memory with OutOfMemoryError: Java heap space exceptions :(.

So I guess ScrollableResults isn't what I was looking for? What is the proper way to handle this? I don't mind if this while loop takes days (well I'd love it to not).

I guess the only other way to handle this is to use setFirstResult and setMaxResults to iterate through the results and just use regular Hibernate results instead of ScrollableResults. That feels like it will be inefficient though and will start taking a ridiculously long time when I'm calling setFirstResult on the 89 millionth row...

UPDATE: setFirstResult/setMaxResults doesn't work, it turns out to take an unusably long time to get to the offsets like I feared. There must be a solution here! Isn't this a pretty standard procedure?? I'm willing to forgo Hibernate and use JDBC or whatever it takes.

UPDATE 2: the solution I've come up with which works ok, not great, is basically of the form:

select * from person where id > <offset> and <other_conditions> limit 1 

Since I have other conditions, even all in an index, it's still not as fast as I'd like it to be... so still open for other suggestions..

like image 953
at. Avatar asked May 13 '10 11:05

at.


2 Answers

Using setFirstResult and setMaxResults is your only option that I'm aware of.

Traditionally a scrollable resultset would only transfer rows to the client on an as required basis. Unfortunately the MySQL Connector/J actually fakes it, it executes the entire query and transports it to the client, so the driver actually has the entire result set loaded in RAM and will drip feed it to you (evidenced by your out of memory problems). You had the right idea, it's just shortcomings in the MySQL java driver.

I found no way to get around this, so went with loading large chunks using the regular setFirst/max methods. Sorry to be the bringer of bad news.

Just make sure to use a stateless session so there's no session level cache or dirty tracking etc.

EDIT:

Your UPDATE 2 is the best you're going to get unless you break out of the MySQL J/Connector. Though there's no reason you can't up the limit on the query. Provided you have enough RAM to hold the index this should be a somewhat cheap operation. I'd modify it slightly, and grab a batch at a time, and use the highest id of that batch to grab the next batch.

Note: this will only work if other_conditions use equality (no range conditions allowed) and have the last column of the index as id.

select *  from person  where id > <max_id_of_last_batch> and <other_conditions>  order by id asc   limit <batch_size> 
like image 161
Michael Avatar answered Sep 23 '22 23:09

Michael


You should be able to use a ScrollableResults, though it requires a few magic incantations to get working with MySQL. I wrote up my findings in a blog post (http://www.numerati.com/2012/06/26/reading-large-result-sets-with-hibernate-and-mysql/) but I'll summarize here:

"The [JDBC] documentation says:

To enable this functionality, create a Statement instance in the following manner: stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,                 java.sql.ResultSet.CONCUR_READ_ONLY); stmt.setFetchSize(Integer.MIN_VALUE); 

This can be done using the Query interface (this should work for Criteria as well) in version 3.2+ of the Hibernate API:

Query query = session.createQuery(query); query.setReadOnly(true); // MIN_VALUE gives hint to JDBC driver to stream results query.setFetchSize(Integer.MIN_VALUE); ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY); // iterate over results while (results.next()) {     Object row = results.get();     // process row then release reference     // you may need to evict() as well } results.close(); 

This allows you to stream over the result set, however Hibernate will still cache results in the Session, so you’ll need to call session.evict() or session.clear() every so often. If you are only reading data, you might consider using a StatelessSession, though you should read its documentation beforehand."

like image 23
Sean S. Avatar answered Sep 25 '22 23:09

Sean S.