Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the chunksize parameter in pandas.read_sql() avoid loading data into memory

I'm iterating through the results of pd.read_sql(query, engine, chunksize=10000)

I'm doing this with engine (sqlalchemy) set to echo=True so that it prints out the raw sql commands that Pandas is hitting the db (postgres) with.

The printouts show that Pandas hits the db only once with exactly the query I wrote, without any modifications. With this in mind, how is it possible for Pandas to iterate through the full output of that query in chunks, while also not storing all chunks in memory at once?

like image 204
Matt Avatar asked Nov 07 '22 10:11

Matt


1 Answers

The single SQL query makes the database aware of which results it needs to return.

Actually returning the results is handled by the communication protocol that your driver (probably psycopg2 for python) handles.

That protocol allows for streaming result sets. Those results can then be chunked at either the driver and/or pandas layer without executing multiple SQL statements.

like image 199
Oliver Rice Avatar answered Nov 12 '22 14:11

Oliver Rice