I have a ~10M record MySQL table that I interface with using SqlAlchemy. I have found that queries on large subsets of this table will consume too much memory even though I thought I was using a built-in generator that intelligently fetched bite-sized chunks of the dataset:
for thing in session.query(Things): analyze(thing)
To avoid this, I find I have to build my own iterator that bites off in chunks:
lastThingID = None while True: things = query.filter(Thing.id < lastThingID).limit(querySize).all() if not rows or len(rows) == 0: break for thing in things: lastThingID = row.id analyze(thing)
Is this normal or is there something I'm missing regarding SA built-in generators?
The answer to this question seems to indicate that the memory consumption is not to be expected.
SQLAlchemy is very, very fast. It's just that users tend to be unaware of just how much functionality is being delivered, and confuse an ORM result set with that of a raw database cursor. They are quite different, and SQLAlchemy offers many options for controlling the mixture of "raw" vs.
SQLAlchemy is great because it provides a good connection / pooling infrastructure; a good Pythonic query building infrastructure; and then a good ORM infrastructure that is capable of complex queries and mappings (as well as some pretty stone-simple ones).
If you want to view your data in a more schema-centric view (as used in SQL), use Core. If you have data for which business objects are not needed, use Core. If you view your data as business objects, use ORM. If you are building a quick prototype, use ORM.
SQLAlchemy is the ORM of choice for working with relational databases in python. The reason why SQLAlchemy is so popular is because it is very simple to implement, helps you develop your code quicker and doesn't require knowledge of SQL to get started.
Most DBAPI implementations fully buffer rows as they are fetched - so usually, before the SQLAlchemy ORM even gets a hold of one result, the whole result set is in memory.
But then, the way Query
works is that it fully loads the given result set by default before returning to you your objects. The rationale here regards queries that are more than simple SELECT statements. For example, in joins to other tables that may return the same object identity multiple times in one result set (common with eager loading), the full set of rows needs to be in memory so that the correct results can be returned otherwise collections and such might be only partially populated.
So Query
offers an option to change this behavior through yield_per()
. This call will cause the Query
to yield rows in batches, where you give it the batch size. As the docs state, this is only appropriate if you aren't doing any kind of eager loading of collections so it's basically if you really know what you're doing. Also, if the underlying DBAPI pre-buffers rows, there will still be that memory overhead so the approach only scales slightly better than not using it.
I hardly ever use yield_per()
; instead, I use a better version of the LIMIT approach you suggest above using window functions. LIMIT and OFFSET have a huge problem that very large OFFSET values cause the query to get slower and slower, as an OFFSET of N causes it to page through N rows - it's like doing the same query fifty times instead of one, each time reading a larger and larger number of rows. With a window-function approach, I pre-fetch a set of "window" values that refer to chunks of the table I want to select. I then emit individual SELECT statements that each pull from one of those windows at a time.
The window function approach is on the wiki and I use it with great success.
Also note: not all databases support window functions; you need Postgresql, Oracle, or SQL Server. IMHO using at least Postgresql is definitely worth it - if you're using a relational database, you might as well use the best.
I am not a database expert, but when using SQLAlchemy as a simple Python abstraction layer (ie, not using the ORM Query object) I've came up with a satisfying solution to query a 300M-row table without exploding memory usage...
Here is a dummy example:
from sqlalchemy import create_engine, select conn = create_engine("DB URL...").connect() q = select([huge_table]) proxy = conn.execution_options(stream_results=True).execute(q)
Then, I use the SQLAlchemy fetchmany()
method to iterate over the results in a infinite while
loop:
while 'batch not empty': # equivalent of 'while True', but clearer batch = proxy.fetchmany(100000) # 100,000 rows at a time if not batch: break for row in batch: # Do your stuff here... proxy.close()
This method allowed me to do all kind of data aggregation without any dangerous memory overhead.
NOTE
the stream_results
works with Postgres and the pyscopg2
adapter, but I guess it won't work with any DBAPI, nor with any database driver...
There is an interesting usecase in this blog post that inspired my above method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With