Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Better understand SQLalchemy's `yield_per()` problems

To cite SQLalchemy documentation:

The Query.yield_per() method is not compatible with most eager loading schemes, including subqueryload and joinedload with collections.

Warning

Use this method with caution; if the same instance is present in more than one batch of rows, end-user changes to attributes will be overwritten.

In particular, it’s usually impossible to use this setting with eagerly loaded collections (i.e. any lazy=’joined’ or ‘subquery’) since those collections will be cleared for a new load when encountered in a subsequent result batch. In the case of ‘subquery’ loading, the full result for all rows is fetched which generally defeats the purpose of yield_per().

Also note that while yield_per() will set the stream_results execution option to True, currently this is only understood by psycopg2 dialect which will stream results using server side cursors instead of pre-buffer all rows for this query. Other DBAPIs pre-buffer all rows before making them available. The memory use of raw database rows is much less than that of an ORM-mapped object, but should still be taken into consideration when benchmarking.

I really have a problem understanding how yield_per() works and what exactly is the problem on using this method. Also what is the right way to workaround these problems and keep using this function for iterating over a huge amount of rows.

I'm interested in all constructive information you have, but here are some hint questions:

  • How can there be multiple instances of the same row? Only through relationships (if two rows of the iterating table have an FK to the same row in another table)? Is there a problem if you do not know that it happens or you only read attributes on the relationships?
  • lazy=’joined’ or ‘subquery’ are not possible, but why exactly? Both of them are simply parts of your query on which you call yield_per().
    • If they are cleared in a subsequent result batch, then simply load it again. So where is the problem? Or is the only problem that you loose the changes of you're relationships if have made changes?
    • In the case of a ‘subquery’ loading, why all rows are fetched? The SQL Server may have to save a big table, but then why not simply return the result in batches one after the other for the entire query?
    • In an example in the yield_per() doc q = sess.query(Object).yield_per(100).options(lazyload('*'), joinedload(Object.some_related)) they deactivate eagerload with lazyload('*') but keep a single joined load. Is there a way to still use yield_per() with eagerload? What are the conditions?
  • They say psycopg2 is the only DBAPI which support stream results. So is that the only DBAPI which you can use with yield_per()? As far as I understand yield_per uses the cursor.fetchmany() (example) function of DBAPI which support many of them. And as far as I understand cursor.fetchmany() supports fetching only parts of the result and does not fetch everything (If it would fetch everything, why the function exists?)
  • I have the feeling that yield_per() is entirely safe (even with eagerload) if you only do read access (for example for statistics). Is that correct?
like image 380
F.Raab Avatar asked Jul 13 '15 15:07

F.Raab


1 Answers

Both of the problematic loading strategies raise exceptions if you try to use them with yield_per, so you don't really have to worry too much.

I believe the only problem with subqueryload is that batched loading of the second query isn't implemented (yet). Nothing would go wrong semantically, but if you're using yield_per, you probably have a really good reason not to want to load all the results at once. So SQLAlchemy politely refuses to go against your wishes.

joinedload is a little more subtle. It's only forbidden in the case of a collection, where a primary row might have multiple associated rows. Say your query produces raw results like this, where A and B are primary keys from different tables:

 A | B 
---+---
 1 | 1 
 1 | 2 
 1 | 3 
 1 | 4 
 2 | 5 
 2 | 6 

Now you fetch these with yield_per(3). The problem is that SQLAlchemy can only limit how much it fetches by rows, but it has to return objects. Here, SQLAlchemy only sees the first three rows, so it creates an A object with key 1 and three B children: 1, 2, and 3.

When it loads the next batch, it wants to create a new A object with key 1... ah, but it already has one of those, so no need to create it again. The extra B, 4, is lost. (So, no, even reading joined collections with yield_per is unsafe — chunks of your data might go missing.)

You might say "well, just keep reading rows until you have a full object" — but what if that A has a hundred children? Or a million? SQLAlchemy can't reasonably guarantee that it can do what you asked and produce correct results, so it refuses to try.


Remember that the DBAPI is designed so that any database can be used with the same API, even if that database doesn't support all the DBAPI features. Consider that the DBAPI is designed around cursors, but MySQL doesn't actually have cursors! The DBAPI adapters for MySQL have to fake them, instead.

So while cursor.fetchmany(100) will work, you can see from the MySQLdb source code that it doesn't fetch lazily from the server; it fetches everything into one big list, then returns a slice when you call fetchmany.

What psycopg2 supports is true streaming, where the results are remembered persistently on the server, and your Python process only sees a few of them at a time.

You can still use yield_per with MySQLdb, or any other DBAPI; that's the whole point of the DBAPI's design. You'll have to pay the memory cost for all the raw rows hidden away in the DBAPI (which are tuples, fairly cheap), but you won't also have to pay for all of the ORM objects at the same time.

like image 72
Eevee Avatar answered Oct 02 '22 23:10

Eevee