This is more of a question on understanding than programming. I am quite new to Pandas and SQL. I am using pandas to read data from SQL with some specific chunksize. When I run a sql query e.g. import pandas as pd <pre class="prettyprint"><code>df = pd.read_sql_query('select name, birthdate from table1', chunksize = 1000) </code></pre> What I do not understand is when I do not give a chunksize, data is stored in the memory and I can see the memory growing however, when I give a chunksize the memory usage is not that high. I have is that this df now contains a number of arrays which I can access as <pre class="prettyprint"><code>for df_array in df: print df.head(5) </code></pre> What I do not understand here is if the entire result of the SQL statement is kept in memory i.e. df is an object carrying multiple arrays or if these are like pointers pointing towards a temp table created by SQL query. I would be very glad to develop some understanding about how this process is actually working.

Let's consider two options and what happens in both cases: <ol> <li>chunksize is None(default value): <ul> <li>pandas passes query to database</li> <li>database executes query</li> <li>pandas checks and sees that chunksize is None</li> <li>pandas tells database that it wants to receive all rows of the result table at once</li> <li>database returns all rows of the result table</li> <li>pandas stores the result table in memory and wraps it into a data frame</li> <li>now you can use the data frame</li> </ul> </li> <li>chunksize in not None: <ul> <li>pandas passes query to database</li> <li>database executes query</li> <li>pandas checks and sees that chunksize has some value</li> <li>pandas creates a query iterator(usual 'while True' loop which breaks when database says that there is no more data left) and iterates over it each time you want the next chunk of the result table</li> <li>pandas tells database that it wants to receive chunksize rows</li> <li>database returns the next chunksize rows from the result table</li> <li>pandas stores the next chunksize rows in memory and wraps it into a data frame</li> <li>now you can use the data frame</li> </ul> </li> </ol> For more details you can see pandas\io\sql.py module, it is well documented

Pandas SQL chunksize

Tags:

python

sql-server

pandas

chunks

This is more of a question on understanding than programming. I am quite new to Pandas and SQL. I am using pandas to read data from SQL with some specific chunksize. When I run a sql query e.g. import pandas as pd

df = pd.read_sql_query('select name, birthdate from table1', chunksize = 1000)

What I do not understand is when I do not give a chunksize, data is stored in the memory and I can see the memory growing however, when I give a chunksize the memory usage is not that high.

I have is that this df now contains a number of arrays which I can access as

for df_array in df:     print df.head(5)

What I do not understand here is if the entire result of the SQL statement is kept in memory i.e. df is an object carrying multiple arrays or if these are like pointers pointing towards a temp table created by SQL query.

I would be very glad to develop some understanding about how this process is actually working.

790

asked Aug 05 '15 16:08

Nitin Kumar

Video Answer

1 Answers

Let's consider two options and what happens in both cases:

chunksize is None(default value):
- pandas passes query to database
- database executes query
- pandas checks and sees that chunksize is None
- pandas tells database that it wants to receive all rows of the result table at once
- database returns all rows of the result table
- pandas stores the result table in memory and wraps it into a data frame
- now you can use the data frame
chunksize in not None:
- pandas passes query to database
- database executes query
- pandas checks and sees that chunksize has some value
- pandas creates a query iterator(usual 'while True' loop which breaks when database says that there is no more data left) and iterates over it each time you want the next chunk of the result table
- pandas tells database that it wants to receive chunksize rows
- database returns the next chunksize rows from the result table
- pandas stores the next chunksize rows in memory and wraps it into a data frame
- now you can use the data frame

For more details you can see pandas\io\sql.py module, it is well documented

answered Sep 28 '22 05:09

prusya

Related questions
                            
                                Increase tick label font size in seaborn
                            
                                Preventing Python code from importing certain modules?
                            
                                Sets module deprecated warning
                            
                                How to perform bilinear interpolation in Python
                            
                                Removing an item from list matching a substring
                            
                                Iterate over sections in a config file
                            
                                Slicing a list in Django template
                            
                                Python: gensim: RuntimeError: you must first build vocabulary before training the model
                            
                                Cannot import QtWebKitWidgets in PyQt5
                            
                                How can I check source code of a module in Jupyter notebook?
                            
                                cx_Oracle error. DPI-1047: Cannot locate a 64-bit Oracle Client library
                            
                                find time shift between two similar waveforms
                            
                                Does a derived class automatically have all the attributes of the base class?
                            
                                Use of threading.Thread.join()
                            
                                in python, get the output of system command as a string [duplicate]
                            
                                How to remove a directory including all its files in python?
                            
                                How to get the current log level in python logging module
                            
                                Why is `self` in Python objects immutable?
                            
                                How can I test whether a variable holds a lambda?
                            
                                Check if a key exists in a Python list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With