Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a large pandas dataframe from an sql query without running out of memory?

I have trouble querying a table of > 5 million records from MS SQL Server database. I want to select all of the records, but my code seems to fail when selecting to much data into memory.

This works:

import pandas.io.sql as psql sql = "SELECT TOP 1000000 * FROM MyTable"  data = psql.read_frame(sql, cnxn) 

...but this does not work:

sql = "SELECT TOP 2000000 * FROM MyTable"  data = psql.read_frame(sql, cnxn) 

It returns this error:

File "inference.pyx", line 931, in pandas.lib.to_object_array_tuples (pandas\lib.c:42733) Memory Error 

I have read here that a similar problem exists when creating a dataframe from a csv file, and that the work-around is to use the 'iterator' and 'chunksize' parameters like this:

read_csv('exp4326.csv', iterator=True, chunksize=1000) 

Is there a similar solution for querying from an SQL database? If not, what is the preferred work-around? Should I use some other methods to read the records in chunks? I read a bit of discussion here about working with large datasets in pandas, but it seems like a lot of work to execute a SELECT * query. Surely there is a simpler approach.

like image 474
slizb Avatar asked Aug 07 '13 15:08

slizb


People also ask

How do I make Pandas use less memory?

Changing numeric columns to smaller dtype: Instead, we can downcast the data types. Simply Convert the int64 values as int8 and float64 as float8. This will reduce memory usage.

Is Pandas read SQL slow?

Reading SQL queries into Pandas dataframes is a common task, and one that can be very slow. Depending on the database being used, this may be hard to get around, but for those of us using Postgres we can speed this up considerably using the COPY command.

How much memory does a Pandas Dataframe use?

By default, Pandas returns the memory used just by the NumPy array it's using to store the data. For strings, this is just 8 multiplied by the number of strings in the column, since NumPy is just storing 64-bit pointers.

How many GB can Pandas handle?

The upper limit for pandas Dataframe was 100 GB of free disk space on the machine. When your Mac needs memory, it will push something that isn't currently being used into a swapfile for temporary storage.


1 Answers

As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk:

sql = "SELECT * FROM My_Table" for chunk in pd.read_sql_query(sql , engine, chunksize=5):     print(chunk) 

Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying

like image 100
Kamil Sindi Avatar answered Sep 19 '22 12:09

Kamil Sindi