How does the chunksize parameter in pandas.read_sql() avoid loading data into memory

Question

I'm iterating through the results of pd.read_sql(query, engine, chunksize=10000)

I'm doing this with engine (sqlalchemy) set to echo=True so that it prints out the raw sql commands that Pandas is hitting the db (postgres) with.

The printouts show that Pandas hits the db only once with exactly the query I wrote, without any modifications. With this in mind, how is it possible for Pandas to iterate through the full output of that query in chunks, while also not storing all chunks in memory at once?

Oliver Rice · Accepted Answer

The single SQL query makes the database aware of which results it needs to return.

Actually returning the results is handled by the communication protocol that your driver (probably psycopg2 for python) handles.

That protocol allows for streaming result sets. Those results can then be chunked at either the driver and/or pandas layer without executing multiple SQL statements.

How does the chunksize parameter in pandas.read_sql() avoid loading data into memory

Tags:

python

pandas

postgresql

sqlalchemy

Matt

1 Answers

Oliver Rice

Recent Activity

Donate For Us

How does the chunksize parameter in pandas.read_sql() avoid loading data into memory

Tags:

python

pandas

postgresql

sqlalchemy

Matt

1 Answers

Oliver Rice

Related questions

Recent Activity

Donate For Us