Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimal chunksize parameter in pandas.DataFrame.to_sql

Tags:

Working with a large pandas DataFrame that needs to be dumped into a PostgreSQL table. From what I've read it's not a good idea to dump all at once, (and I was locking up the db) rather use the chunksize parameter. The answers here are helpful for workflow, but I'm just asking about the value of chunksize affecting performance.

In [5]: df.shape Out[5]: (24594591, 4)  In [6]: df.to_sql('existing_table',                   con=engine,                    index=False,                    if_exists='append',                    chunksize=10000) 

Is there a recommended default and is there a difference in performance when setting the parameter higher or lower? Assuming I have the memory to support a larger chunksize, will it execute faster?

like image 326
Kevin Avatar asked Feb 04 '16 13:02

Kevin


People also ask

Is pandas To_sql fast?

to_sql seems to send an INSERT query for every row which makes it really slow. But since 0.24. 0 there is a method parameter in pandas. to_sql() where you can define your own insertion function or just use method='multi' to tell pandas to pass multiple rows in a single INSERT query, which makes it a lot faster.

How do you use Chunksize pandas?

Sometimes, we use the chunksize parameter while reading large datasets to divide the dataset into chunks of data. We specify the size of these chunks with the chunksize parameter. This saves computational memory and improves the efficiency of the code.

What is Chunksize in Python?

Technically the number of rows read at a time in a file by pandas is referred to as chunksize. Suppose If the chunksize is 100 then pandas will load the first 100 rows.


2 Answers

In my case, 3M rows having 5 columns were inserted in 8 mins when I used pandas to_sql function parameters as chunksize=5000 and method='multi'. This was a huge improvement as inserting 3M rows using python into the database was becoming very hard for me.

like image 55
Hardik Ojha Avatar answered Oct 19 '22 10:10

Hardik Ojha


I tried something the other way around. From sql to csv and I noticed that the smaller the chunksize the quicker the job was done. Adding additional cpus to the job (multiprocessing) didn't change anything.

like image 39
Mohamed Amin Chairi Avatar answered Oct 19 '22 11:10

Mohamed Amin Chairi