I am trying to retrieve a large amount of data(more than 7 million) from database and trying to save a s flat file. The data is being retrieved using python code(python calls stored procedure). But I am having a problem here. The process is eating up so much of memory hence killing the process automatically by unix machine. I am using read_sql_query to read the data and to_csv to write into flat file. So, I wanted to ask if there is a way to solve this problem. May be reading only a few thousand rows at a time and saving them and go to next line. I even used chunksize parameter as well. But it does not seem to resolve the issue.
Any help or suggestion will be greatly appreciated.
When you use chunksize in read_sql_query, you can iterate over the result to avoid loading everything into memory at once. However, you also have to write out to the CSV file in chunks to make sure you aren't just copying the results of the query into a new, gigantic DataFrame chunk by chunk. Be careful to only write the column headers once. Here is an example using pandas:
import pandas as pd
dbcon = ... # whatever
with open("out.csv", "w") as fh:
chunks = pd.read_sql_query("SELECT * FROM table_name", dbcon, chunksize=10000)
next(chunks).to_csv(fh, index=False) # write the first chunk with the column names,
# but ignore the index (which will be screwed up anyway due to the chunking)
for chunk in chunks:
chunk.to_csv(fh, index=False, header=False) # skip the column names from now on
You don't have to ignore the index when writing the CSV if you explicitly set index_col in the call to read_sql_query.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With