Working with a large pandas DataFrame that needs to be dumped into a PostgreSQL table. From what I've read it's not a good idea to dump all at once, (and I was locking up the db) rather use the chunksize
parameter. The answers here are helpful for workflow, but I'm just asking about the value of chunksize affecting performance.
In [5]: df.shape Out[5]: (24594591, 4) In [6]: df.to_sql('existing_table', con=engine, index=False, if_exists='append', chunksize=10000)
Is there a recommended default and is there a difference in performance when setting the parameter higher or lower? Assuming I have the memory to support a larger chunksize, will it execute faster?
to_sql seems to send an INSERT query for every row which makes it really slow. But since 0.24. 0 there is a method parameter in pandas. to_sql() where you can define your own insertion function or just use method='multi' to tell pandas to pass multiple rows in a single INSERT query, which makes it a lot faster.
Sometimes, we use the chunksize parameter while reading large datasets to divide the dataset into chunks of data. We specify the size of these chunks with the chunksize parameter. This saves computational memory and improves the efficiency of the code.
Technically the number of rows read at a time in a file by pandas is referred to as chunksize. Suppose If the chunksize is 100 then pandas will load the first 100 rows.
In my case, 3M rows having 5 columns were inserted in 8 mins when I used pandas to_sql
function parameters as chunksize=5000 and method='multi'. This was a huge improvement as inserting 3M rows using python into the database was becoming very hard for me.
I tried something the other way around. From sql to csv and I noticed that the smaller the chunksize the quicker the job was done. Adding additional cpus to the job (multiprocessing) didn't change anything.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With