I have trouble querying a table of > 5 million records from MS SQL Server database. I want to select all of the records, but my code seems to fail when selecting to much data into memory.
This works:
import pandas.io.sql as psql sql = "SELECT TOP 1000000 * FROM MyTable" data = psql.read_frame(sql, cnxn)
...but this does not work:
sql = "SELECT TOP 2000000 * FROM MyTable" data = psql.read_frame(sql, cnxn)
It returns this error:
File "inference.pyx", line 931, in pandas.lib.to_object_array_tuples (pandas\lib.c:42733) Memory Error
I have read here that a similar problem exists when creating a dataframe
from a csv file, and that the work-around is to use the 'iterator' and 'chunksize' parameters like this:
read_csv('exp4326.csv', iterator=True, chunksize=1000)
Is there a similar solution for querying from an SQL database? If not, what is the preferred work-around? Should I use some other methods to read the records in chunks? I read a bit of discussion here about working with large datasets in pandas, but it seems like a lot of work to execute a SELECT * query. Surely there is a simpler approach.
Changing numeric columns to smaller dtype: Instead, we can downcast the data types. Simply Convert the int64 values as int8 and float64 as float8. This will reduce memory usage.
Reading SQL queries into Pandas dataframes is a common task, and one that can be very slow. Depending on the database being used, this may be hard to get around, but for those of us using Postgres we can speed this up considerably using the COPY command.
By default, Pandas returns the memory used just by the NumPy array it's using to store the data. For strings, this is just 8 multiplied by the number of strings in the column, since NumPy is just storing 64-bit pointers.
The upper limit for pandas Dataframe was 100 GB of free disk space on the machine. When your Mac needs memory, it will push something that isn't currently being used into a swapfile for temporary storage.
As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql
to read and process the query chunk by chunk:
sql = "SELECT * FROM My_Table" for chunk in pd.read_sql_query(sql , engine, chunksize=5): print(chunk)
Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With