Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speeding up data insertion from pandas dataframe to mysql

I need to insert a 60000x24 dataframe into a mysql database (MariaDB) using sqlalchemy and python. The database runs locally and the data insertion runs locally as well. For now I have been using the LOAD DATA INFILE sql query, but this requires the dataframe to be dumped into a CSV file, which takes about 1.5-2 seconds. The problem is I have to insert 40 or more of these dataframes, so the time is critical.

If I use df.to_sql then the problem gets much worse. The data insertion takes at least 7 (up to 30) seconds per dataframe.

The code I´m using is provided here below:

sql_query ="CREATE TABLE IF NOT EXISTS table(A FLOAT, B FLOAT, C FLOAT)"# 24 columns of type float
cursor.execute(sql_query)
data.to_sql("table", con=connection, if_exists="replace", chunksize=1000)

Which takes between 7 and 30 seconds to be executed. Using LOAD DATA, the code looks like:

sql_query = "CREATE TABLE IF NOT EXISTS table(A FLOAT, B FLOAT, C FLOAT)"# 24 columns of type float
cursor.execute(sql_query)
data.to_csv("/tmp/data.csv")
sql_query = "LOAD DATA LOW_PRIORITY INFILE '/tmp/data.csv' REPLACE INTO TABLE 'table' FIELDS TERMINATED BY ','; "
cursor.execute(sql_query)

This takes 1.5 to 2 seconds, mainly due to dumping the file to CSV. I could improve this last one a bit by using LOCK TABLES, but then no data is added into the database. So, my questions here is, is there any method to speed this process up, either by tweaking LOAD DATA or to_sql?

UPDATE: By using an alternative function to dump the dataframes into CSV files given by this answer What is the fastest way to output large DataFrame into a CSV file? I´m able to improve a little bit of the performance, but not that significantly. Best,

like image 264
Charlie Avatar asked Aug 08 '19 10:08

Charlie


People also ask

How long does DF to_sql take?

If I export it to csv with dataframe. to_csv , the output is an 11MB file (which is produced instantly). If, however, I export to a Microsoft SQL Server with the to_sql method, it takes between 5 and 6 minutes! No columns are text: only int, float, bool and dates.

Is pandas quicker than SQL?

This main difference can mean that the two tools are separate, however, you can also perform several of the same functions in each respective tool, for example, you can create new features from existing columns in pandas, perhaps easier and faster than in SQL.

Is PyArrow faster than pandas?

There's a better way. It's called PyArrow — an amazing Python binding for the Apache Arrow project. It introduces faster data read/write times and doesn't otherwise interfere with your data analysis pipeline. It's the best of both worlds, as you can still use Pandas for further calculations.


1 Answers

If you know the data format (I assume all floats), you can use numpy.savetxt() to drastically reduce time needed to create CSV:

%timeit df.to_csv(csv_fname)
2.22 s ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  

from numpy import savetxt
%timeit savetxt(csv_fname, df.values, fmt='%f', header=','.join(df.columns), delimiter=',')
714 ms ± 37.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Please note that you may need to prepend

df = df.reset_index()

to have lines numbered with unique keys and retain the .to_csv() formatting style.

like image 141
igrinis Avatar answered Oct 17 '22 01:10

igrinis