Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas SQL chunksize

This is more of a question on understanding than programming. I am quite new to Pandas and SQL. I am using pandas to read data from SQL with some specific chunksize. When I run a sql query e.g. import pandas as pd

df = pd.read_sql_query('select name, birthdate from table1', chunksize = 1000) 

What I do not understand is when I do not give a chunksize, data is stored in the memory and I can see the memory growing however, when I give a chunksize the memory usage is not that high.

I have is that this df now contains a number of arrays which I can access as

for df_array in df:     print df.head(5) 

What I do not understand here is if the entire result of the SQL statement is kept in memory i.e. df is an object carrying multiple arrays or if these are like pointers pointing towards a temp table created by SQL query.

I would be very glad to develop some understanding about how this process is actually working.

like image 790
Nitin Kumar Avatar asked Aug 05 '15 16:08

Nitin Kumar


People also ask

How do you use Chunksize in pandas?

Sometimes, we use the chunksize parameter while reading large datasets to divide the dataset into chunks of data. We specify the size of these chunks with the chunksize parameter. This saves computational memory and improves the efficiency of the code.

Is Panda faster than MySQL?

This main difference can mean that the two tools are separate, however, you can also perform several of the same functions in each respective tool, for example, you can create new features from existing columns in pandas, perhaps easier and faster than in SQL.

Can we use SQL in pandas DataFrame?

Pandasql can work both on Pandas DataFrame and Series . The sqldf method is used to query the Dataframes and it requires 2 inputs: The SQL query string.


Video Answer


1 Answers

Let's consider two options and what happens in both cases:

  1. chunksize is None(default value):
    • pandas passes query to database
    • database executes query
    • pandas checks and sees that chunksize is None
    • pandas tells database that it wants to receive all rows of the result table at once
    • database returns all rows of the result table
    • pandas stores the result table in memory and wraps it into a data frame
    • now you can use the data frame
  2. chunksize in not None:
    • pandas passes query to database
    • database executes query
    • pandas checks and sees that chunksize has some value
    • pandas creates a query iterator(usual 'while True' loop which breaks when database says that there is no more data left) and iterates over it each time you want the next chunk of the result table
    • pandas tells database that it wants to receive chunksize rows
    • database returns the next chunksize rows from the result table
    • pandas stores the next chunksize rows in memory and wraps it into a data frame
    • now you can use the data frame

For more details you can see pandas\io\sql.py module, it is well documented

like image 96
prusya Avatar answered Sep 28 '22 05:09

prusya