Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas to_sql() performance - why is it so slow?

I am running into performance issues with Pandas and writing DataFrames to an SQL DB. In order to be as fast as possible I use memSQL (it's like MySQL in code, so I don't have to do anything). I benchmarked my instance just now:

docker run --rm -it --link=memsql:memsql memsql/quickstart simple-benchmark
Creating database simple_benchmark
Warming up workload
Launching 10 workers
Workload will take approximately 30 seconds.
Stopping workload
42985000 rows inserted using 10 threads
1432833.3 rows per second

That isn't glorious, and it's just my local laptop. I know... I am also using the root user, but it's a throw-away Docker container.

Here is the code which writes my DataFrame to the DB:

    import MySQLdb

    import mysql.connector
    from sqlalchemy import create_engine
    from pandas.util.testing import test_parallel

    engine = create_engine('mysql+mysqlconnector://root@localhost:3306/netflow_test', echo=False)
    # max_allowed_packet = 1000M in mysql.conf
    # no effect

    # @test_parallel(num_threads=8)
    def commit_flows(netflow_df2):
        % time netflow_df2.to_sql(name='netflow_ids', con=engine, if_exists = 'append', index=False, chunksize=500)
    commit_flows(netflow_df2)

Below is the %time measurement of the function.

Multi-threading does not make this faster. It remains within 7000 - 8000 rows/s.

CPU times: user 2min 6s, sys: 1.69 s, total: 2min 8s Wall time: 2min 18s

Screenshot: memSQL shows the speed

I also increased the max_allowed_packet size to commit in bulk, with a larger chunk size. Still not faster.

Here is the shape of the DataFrame:

netflow_df2.shape
(1015391, 20)

Does anyone know how I can make this faster?

like image 747
wishi Avatar asked Oct 03 '16 18:10

wishi


1 Answers

In case someones gets a similar situation:

I removed SQlalchemy and used the (deprecated) MySQL flavor for Pandas' to_sql() function. The speedup is more than 120 %. I don't recommend to use this, but it works for me at the moment.

import MySQLdb

import mysql.connector
from sqlalchemy import create_engine
from pandas.util.testing import test_parallel

engine = MySQLdb.connect("127.0.0.1","root","","netflow_test")

# engine = create_engine('mysql+mysqlconnector://root@localhost:3306/netflow_test', echo=False)

# @test_parallel(num_threads=8)
def commit_flows(netflow_df2):
    % time netflow_df2.to_sql(name='netflow_ids', flavor='mysql', con=engine, if_exists = 'append', index=False, chunksize=50000)
commit_flows(netflow_df2)

If I find out how to convince memSQL to accept a large query (similar to MySQL's max_allowed_packet = 1000M in mysql.conf) I will be even faster. I should be able to hit more than 50000 rows per second here.

CPU times: user 28.7 s, sys: 797 ms, total: 29.5 s
Wall time: 38.2 s

126s before. 38.2s now.

like image 157
wishi Avatar answered Sep 20 '22 03:09

wishi