To cite SQLalchemy documentation:
The Query.yield_per() method is not compatible with most eager loading schemes, including subqueryload and joinedload with collections.
Warning
Use this method with caution; if the same instance is present in more than one batch of rows, end-user changes to attributes will be overwritten.
In particular, it’s usually impossible to use this setting with eagerly loaded collections (i.e. any lazy=’joined’ or ‘subquery’) since those collections will be cleared for a new load when encountered in a subsequent result batch. In the case of ‘subquery’ loading, the full result for all rows is fetched which generally defeats the purpose of yield_per().
Also note that while yield_per() will set the stream_results execution option to True, currently this is only understood by psycopg2 dialect which will stream results using server side cursors instead of pre-buffer all rows for this query. Other DBAPIs pre-buffer all rows before making them available. The memory use of raw database rows is much less than that of an ORM-mapped object, but should still be taken into consideration when benchmarking.
I really have a problem understanding how yield_per()
works and what exactly is the problem on using this method. Also what is the right way to workaround these problems and keep using this function for iterating over a huge amount of rows.
I'm interested in all constructive information you have, but here are some hint questions:
yield_per()
.
yield_per()
doc q = sess.query(Object).yield_per(100).options(lazyload('*'), joinedload(Object.some_related))
they deactivate eagerload with lazyload('*')
but keep a single joined load. Is there a way to still use yield_per()
with eagerload? What are the conditions?psycopg2
is the only DBAPI which support stream results. So is that the only DBAPI which you can use with yield_per()
? As far as I understand yield_per
uses the cursor.fetchmany()
(example) function of DBAPI which support many of them. And as far as I understand cursor.fetchmany()
supports fetching only parts of the result and does not fetch everything (If it would fetch everything, why the function exists?)yield_per()
is entirely safe (even with eagerload) if you only do read access (for example for statistics). Is that correct?Both of the problematic loading strategies raise exceptions if you try to use them with yield_per
, so you don't really have to worry too much.
I believe the only problem with subqueryload
is that batched loading of the second query isn't implemented (yet). Nothing would go wrong semantically, but if you're using yield_per
, you probably have a really good reason not to want to load all the results at once. So SQLAlchemy politely refuses to go against your wishes.
joinedload
is a little more subtle. It's only forbidden in the case of a collection, where a primary row might have multiple associated rows. Say your query produces raw results like this, where A and B are primary keys from different tables:
A | B
---+---
1 | 1
1 | 2
1 | 3
1 | 4
2 | 5
2 | 6
Now you fetch these with yield_per(3)
. The problem is that SQLAlchemy can only limit how much it fetches by rows, but it has to return objects. Here, SQLAlchemy only sees the first three rows, so it creates an A
object with key 1 and three B
children: 1, 2, and 3.
When it loads the next batch, it wants to create a new A
object with key 1... ah, but it already has one of those, so no need to create it again. The extra B
, 4, is lost. (So, no, even reading joined collections with yield_per
is unsafe — chunks of your data might go missing.)
You might say "well, just keep reading rows until you have a full object" — but what if that A
has a hundred children? Or a million? SQLAlchemy can't reasonably guarantee that it can do what you asked and produce correct results, so it refuses to try.
Remember that the DBAPI is designed so that any database can be used with the same API, even if that database doesn't support all the DBAPI features. Consider that the DBAPI is designed around cursors, but MySQL doesn't actually have cursors! The DBAPI adapters for MySQL have to fake them, instead.
So while cursor.fetchmany(100)
will work, you can see from the MySQLdb
source code that it doesn't fetch lazily from the server; it fetches everything into one big list, then returns a slice when you call fetchmany
.
What psycopg2
supports is true streaming, where the results are remembered persistently on the server, and your Python process only sees a few of them at a time.
You can still use yield_per
with MySQLdb
, or any other DBAPI; that's the whole point of the DBAPI's design. You'll have to pay the memory cost for all the raw rows hidden away in the DBAPI (which are tuples, fairly cheap), but you won't also have to pay for all of the ORM objects at the same time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With