I have trouble querying a table of > 5 million records from MS SQL Server database. I want to select all of the records, but my code seems to fail when selecting to much data into memory. This works: <pre class="prettyprint"><code>import pandas.io.sql as psql sql = "SELECT TOP 1000000 * FROM MyTable" data = psql.read_frame(sql, cnxn) </code></pre> ...but this does not work: <pre class="prettyprint"><code>sql = "SELECT TOP 2000000 * FROM MyTable" data = psql.read_frame(sql, cnxn) </code></pre> It returns this error: <pre class="prettyprint"><code>File "inference.pyx", line 931, in pandas.lib.to_object_array_tuples (pandas\lib.c:42733) Memory Error </code></pre> I have read here that a similar problem exists when creating a <code>dataframe</code> from a csv file, and that the work-around is to use the 'iterator' and 'chunksize' parameters like this: <pre class="prettyprint"><code>read_csv('exp4326.csv', iterator=True, chunksize=1000) </code></pre> Is there a similar solution for querying from an SQL database? If not, what is the preferred work-around? Should I use some other methods to read the records in chunks? I read a bit of discussion here about working with large datasets in pandas, but it seems like a lot of work to execute a SELECT * query. Surely there is a simpler approach.

As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in <code>read_sql</code> to read and process the query chunk by chunk: <pre class="prettyprint"><code>sql = "SELECT * FROM My_Table" for chunk in pd.read_sql_query(sql , engine, chunksize=5): print(chunk) </code></pre> Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying

How to create a large pandas dataframe from an sql query without running out of memory?

Tags:

python

sql

pandas

bigdata

I have trouble querying a table of > 5 million records from MS SQL Server database. I want to select all of the records, but my code seems to fail when selecting to much data into memory.

This works:

import pandas.io.sql as psql sql = "SELECT TOP 1000000 * FROM MyTable"  data = psql.read_frame(sql, cnxn)

...but this does not work:

sql = "SELECT TOP 2000000 * FROM MyTable"  data = psql.read_frame(sql, cnxn)

It returns this error:

File "inference.pyx", line 931, in pandas.lib.to_object_array_tuples (pandas\lib.c:42733) Memory Error

I have read here that a similar problem exists when creating a dataframe from a csv file, and that the work-around is to use the 'iterator' and 'chunksize' parameters like this:

read_csv('exp4326.csv', iterator=True, chunksize=1000)

Is there a similar solution for querying from an SQL database? If not, what is the preferred work-around? Should I use some other methods to read the records in chunks? I read a bit of discussion here about working with large datasets in pandas, but it seems like a lot of work to execute a SELECT * query. Surely there is a simpler approach.

474

asked Aug 07 '13 15:08

slizb

1 Answers

As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk:

sql = "SELECT * FROM My_Table" for chunk in pd.read_sql_query(sql , engine, chunksize=5):     print(chunk)

Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying

100

answered Sep 19 '22 12:09

Kamil Sindi

Related questions
                            
                                OpenCV TypeError: Expected cv::UMat for argument 'src' - What is this?
                            
                                Converting integer to digit list
                            
                                How do I write Flask's excellent debug log message to a file in production?
                            
                                Create (sane/safe) filename from any (unsafe) string
                            
                                Debugging with PyCharm terminal arguments
                            
                                Windows- Pyinstaller Error "failed to execute script " When App Clicked
                            
                                Python lookup hostname from IP with 1 second timeout
                            
                                Cannot find vcvarsall.bat when running a Python script
                            
                                making matplotlib graphs look like R by default?
                            
                                AttributeError while querying: Neither 'InstrumentedAttribute' object nor 'Comparator' has an attribute
                            
                                Python range() and zip() object type
                            
                                How would I compute exactly 30 days into the past with Python (down to the minute)?
                            
                                How can I send an xml body using requests library?
                            
                                log4j with timestamp per log entry
                            
                                Make function definition in a python file order independent
                            
                                How do I create a new database in MongoDB using PyMongo?
                            
                                Iterate over all combinations of values in multiple lists in Python
                            
                                String literal with triple quotes in function definitions
                            
                                Remove all newlines from inside a string
                            
                                Problems with using a rough greyscale algorithm?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With