I have a ~10M record MySQL table that I interface with using SqlAlchemy. I have found that queries on large subsets of this table will consume too much memory even though I thought I was using a built-in generator that intelligently fetched bite-sized chunks of the dataset: <pre class="prettyprint"><code>for thing in session.query(Things): analyze(thing) </code></pre> To avoid this, I find I have to build my own iterator that bites off in chunks: <pre class="prettyprint"><code>lastThingID = None while True: things = query.filter(Thing.id < lastThingID).limit(querySize).all() if not rows or len(rows) == 0: break for thing in things: lastThingID = row.id analyze(thing) </code></pre> Is this normal or is there something I'm missing regarding SA built-in generators? The answer to this question seems to indicate that the memory consumption is not to be expected.

I am not a database expert, but when using SQLAlchemy as a simple Python abstraction layer (ie, not using the ORM Query object) I've came up with a satisfying solution to query a 300M-row table without exploding memory usage... Here is a dummy example: <pre class="prettyprint"><code>from sqlalchemy import create_engine, select conn = create_engine("DB URL...").connect() q = select([huge_table]) proxy = conn.execution_options(stream_results=True).execute(q) </code></pre> Then, I use the SQLAlchemy <code>fetchmany()</code> method to iterate over the results in a infinite <code>while</code> loop: <pre class="prettyprint"><code>while 'batch not empty': # equivalent of 'while True', but clearer batch = proxy.fetchmany(100000) # 100,000 rows at a time if not batch: break for row in batch: # Do your stuff here... proxy.close() </code></pre> This method allowed me to do all kind of data aggregation without any dangerous memory overhead. <code>NOTE</code> the <code>stream_results</code> works with Postgres and the <code>pyscopg2</code> adapter, but I guess it won't work with any DBAPI, nor with any database driver... There is an interesting usecase in this blog post that inspired my above method.

memory-efficient built-in SqlAlchemy iterator/generator?

Tags:

python

mysql

sqlalchemy

I have a ~10M record MySQL table that I interface with using SqlAlchemy. I have found that queries on large subsets of this table will consume too much memory even though I thought I was using a built-in generator that intelligently fetched bite-sized chunks of the dataset:

for thing in session.query(Things):     analyze(thing)

To avoid this, I find I have to build my own iterator that bites off in chunks:

lastThingID = None while True:     things = query.filter(Thing.id < lastThingID).limit(querySize).all()     if not rows or len(rows) == 0:          break     for thing in things:         lastThingID = row.id         analyze(thing)

Is this normal or is there something I'm missing regarding SA built-in generators?

The answer to this question seems to indicate that the memory consumption is not to be expected.

512

asked Sep 12 '11 14:09

Paul

2 Answers

Most DBAPI implementations fully buffer rows as they are fetched - so usually, before the SQLAlchemy ORM even gets a hold of one result, the whole result set is in memory.

But then, the way Query works is that it fully loads the given result set by default before returning to you your objects. The rationale here regards queries that are more than simple SELECT statements. For example, in joins to other tables that may return the same object identity multiple times in one result set (common with eager loading), the full set of rows needs to be in memory so that the correct results can be returned otherwise collections and such might be only partially populated.

So Query offers an option to change this behavior through yield_per(). This call will cause the Query to yield rows in batches, where you give it the batch size. As the docs state, this is only appropriate if you aren't doing any kind of eager loading of collections so it's basically if you really know what you're doing. Also, if the underlying DBAPI pre-buffers rows, there will still be that memory overhead so the approach only scales slightly better than not using it.

I hardly ever use yield_per(); instead, I use a better version of the LIMIT approach you suggest above using window functions. LIMIT and OFFSET have a huge problem that very large OFFSET values cause the query to get slower and slower, as an OFFSET of N causes it to page through N rows - it's like doing the same query fifty times instead of one, each time reading a larger and larger number of rows. With a window-function approach, I pre-fetch a set of "window" values that refer to chunks of the table I want to select. I then emit individual SELECT statements that each pull from one of those windows at a time.

The window function approach is on the wiki and I use it with great success.

Also note: not all databases support window functions; you need Postgresql, Oracle, or SQL Server. IMHO using at least Postgresql is definitely worth it - if you're using a relational database, you might as well use the best.

175

answered Sep 17 '22 21:09

zzzeek

I am not a database expert, but when using SQLAlchemy as a simple Python abstraction layer (ie, not using the ORM Query object) I've came up with a satisfying solution to query a 300M-row table without exploding memory usage...

Here is a dummy example:

from sqlalchemy import create_engine, select  conn = create_engine("DB URL...").connect() q = select([huge_table])  proxy = conn.execution_options(stream_results=True).execute(q)

Then, I use the SQLAlchemy fetchmany() method to iterate over the results in a infinite while loop:

while 'batch not empty':  # equivalent of 'while True', but clearer     batch = proxy.fetchmany(100000)  # 100,000 rows at a time      if not batch:         break      for row in batch:         # Do your stuff here...  proxy.close()

This method allowed me to do all kind of data aggregation without any dangerous memory overhead.

NOTE the stream_results works with Postgres and the pyscopg2 adapter, but I guess it won't work with any DBAPI, nor with any database driver...

There is an interesting usecase in this blog post that inspired my above method.

answered Sep 21 '22 21:09

edthrn

Related questions
                            
                                Why do Python modules sometimes not import their sub-modules?
                            
                                list.index() function for Python that doesn't throw exception when nothing found
                            
                                Why does (inf + 0j)*1 evaluate to inf + nanj?
                            
                                How to change backends in matplotlib / Python
                            
                                Which is the best way to allow configuration options be overridden at the command line in Python?
                            
                                Multi Index Sorting in Pandas
                            
                                How do I make pyCharm stop hiding (unfold) my Python imports?
                            
                                Is there a multi-dimensional version of arange/linspace in numpy?
                            
                                What is the difference between np.mean and tf.reduce_mean?
                            
                                Difference between data type 'datetime64[ns]' and '<M8[ns]'?
                            
                                Getting "global name 'foo' is not defined" with Python's timeit
                            
                                Get first element of Series without knowing the index [duplicate]
                            
                                How to get char from string by index?
                            
                                How can I open an Excel file in Python?
                            
                                In Python, what does dict.pop(a,b) mean?
                            
                                psycopg2: AttributeError: 'module' object has no attribute 'extras'
                            
                                matplotlib.pyplot will not forget previous plots - how can I flush/refresh?
                            
                                Create Empty Dataframe in Pandas specifying column types
                            
                                How do you composite an image onto another image with PIL in Python?
                            
                                Replace invalid values with None in Pandas DataFrame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With