I'm iterating through the results of pd.read_sql(query, engine, chunksize=10000)
I'm doing this with engine
(sqlalchemy) set to echo=True
so that it prints out the raw sql commands that Pandas is hitting the db (postgres) with.
The printouts show that Pandas hits the db only once with exactly the query I wrote, without any modifications. With this in mind, how is it possible for Pandas to iterate through the full output of that query in chunks, while also not storing all chunks in memory at once?
The single SQL query makes the database aware of which results it needs to return.
Actually returning the results is handled by the communication protocol that your driver (probably psycopg2 for python) handles.
That protocol allows for streaming result sets. Those results can then be chunked at either the driver and/or pandas layer without executing multiple SQL statements.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With