This is more of a question on understanding than programming. I am quite new to Pandas and SQL. I am using pandas to read data from SQL with some specific chunksize. When I run a sql query e.g. import pandas as pd
df = pd.read_sql_query('select name, birthdate from table1', chunksize = 1000)
What I do not understand is when I do not give a chunksize, data is stored in the memory and I can see the memory growing however, when I give a chunksize the memory usage is not that high.
I have is that this df now contains a number of arrays which I can access as
for df_array in df: print df.head(5)
What I do not understand here is if the entire result of the SQL statement is kept in memory i.e. df is an object carrying multiple arrays or if these are like pointers pointing towards a temp table created by SQL query.
I would be very glad to develop some understanding about how this process is actually working.
Sometimes, we use the chunksize parameter while reading large datasets to divide the dataset into chunks of data. We specify the size of these chunks with the chunksize parameter. This saves computational memory and improves the efficiency of the code.
This main difference can mean that the two tools are separate, however, you can also perform several of the same functions in each respective tool, for example, you can create new features from existing columns in pandas, perhaps easier and faster than in SQL.
Pandasql can work both on Pandas DataFrame and Series . The sqldf method is used to query the Dataframes and it requires 2 inputs: The SQL query string.
Let's consider two options and what happens in both cases:
For more details you can see pandas\io\sql.py module, it is well documented
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With