Speeding (Bulk) Insert into MySQL with Python

Tags:

I'm deploying an application to consume some .csv data. I want to copy them to a MySQL table. With some help from the stackoverflow users I wrote the code bellow:

import csv
import MySQLdb

db = MySQLdb.connect(   host = "dbname.description.host.com",
                        user = "user",
                        passwd = "key",
                        db = "dbname")
cursor = db.cursor()

query = 'INSERT INTO table_name(column,column_1,column_2,column_3)
VALUES(%s, %s, %s, %s)'                                                         

csv_data = csv.reader(file('file_name'))

for row in csv_data:
     cursor.execute(query,row)
     db.commit()

cursor.close()

The problem is, currently, the process is so slow and I need to speed the things up.

812

asked Oct 03 '17 11:10

5 Answers

you can use executemany to batch the job as follows

import csv
import MySQLdb

db = MySQLdb.connect(   host = "dbname.description.host.com",
                        user = "user",
                        passwd = "key",
                        db = "dbname")
cursor = db.cursor()

query = 'INSERT INTO table_name(column,column_1,column_2,column_3)
VALUES(%s, %s, %s, %s)'                                                         

csv_data = csv.reader(file('file_name'))

my_data = []
for row in csv_data:
     my_data.append(tuple(row))

cursor.executemany(query, my_data)
cursor.close()

117

answered Oct 13 '22 20:10

the code you are using is ultra inefficient for a number of reasons as you are committing each of your data one row at a time (which would be what you want for a transactional DB or process) but not for a one-off dump.

There are a number of ways to speed this up ranging from great to not so great. Here are 4 approaches, including the naive implementation (above)

#!/usr/bin/env python
import pandas as pd
import numpy as np
import odo
import profilehooks
import sqlalchemy
import csv
import os


def create_test_data():
    n = 100000
    df = pd.DataFrame(dict(
        id=np.random.randint(0, 1000000, n),
        col1=np.random.choice(['hello', 'world', 'python', 'large string for testing ' * 10], n),
        col2=np.random.randint(-1000000, 1000000, n),
        col3=np.random.randint(-9000000, 9000000, n),
        col4=(np.random.random(n) - 0.5) * 99999
    ), columns=['id', 'col1', 'col2', 'col3', 'col4'])
    df.to_csv('tmp.csv', index=False)


@profilehooks.timecall
def using_pandas(table_name, uri):
    df = pd.read_csv('tmp.csv')
    df.to_sql(table_name, con=uri, if_exists='append', index=False)


@profilehooks.timecall
def using_odo(table_name, uri):
    odo.odo('tmp.csv', '%s::%s' % (uri, table_name))


@profilehooks.timecall
def using_cursor(table_name, uri):
    engine = sqlalchemy.create_engine(uri)
    query = 'INSERT INTO {} (id, col1, col2, col3, col4) VALUES(%s, %s, %s, %s, %s)'
    query = query.format(table_name)
    con = engine.raw_connection()
    with con.cursor() as cursor:
        with open('tmp.csv') as fh:
            reader = csv.reader(fh)
            next(reader)  # Skip firt line (headers)
            for row in reader:
                cursor.execute(query, row)
                con.commit()
    con.close()


@profilehooks.timecall
def using_cursor_correct(table_name, uri):
    engine = sqlalchemy.create_engine(uri)
    query = 'INSERT INTO {} (id, col1, col2, col3, col4) VALUES(%s, %s, %s, %s, %s)'
    query = query.format(table_name)
    with open('tmp.csv') as fh:
        reader = csv.reader(fh)
        next(reader)  # Skip firt line (headers)
        data = list(reader)
    engine.execute(query, data)


def main():
    uri = 'mysql+pymysql://root:%s@localhost/test' % os.environ['pass']

    engine = sqlalchemy.create_engine(uri)
    for i in (1, 2, 3, 4):
        engine.execute("DROP TABLE IF EXISTS table%s" % i)
        engine.execute("""
            CREATE TABLE table%s(
                id INT,
                col1 VARCHAR(255),
                col2 INT,
                col3 INT,
                col4 DOUBLE
            );
        """ % i)
    create_test_data()

    using_odo('table1', uri)
    using_pandas('table4', uri)
    using_cursor_correct('table3', uri)
    using_cursor('table2', uri)

    for i in (1, 2, 3, 4):
        count = pd.read_sql('SELECT COUNT(*) as c FROM table%s' % i, con=uri)['c'][0]
        print("Count for table%s - %s" % (i, count))


if __name__ == '__main__':
    main()

The odo method is the fastest (uses MySQL LOAD DATA INFILE under the hood) Next is Pandas (critical code paths are optimized) Next is using a raw cursor but inserting rows in bulk Last is the naive method, committing one row at a time

Here are some examples timings running locally against a local MySQL server.

using_odo (./test.py:29): 0.516 seconds

using_pandas (./test.py:23): 3.039 seconds

using_cursor_correct (./test.py:50): 12.847 seconds

using_cursor (./test.py:34): 43.470 seconds

Count for table1 - 100000

Count for table2 - 100000

Count for table3 - 100000

Count for table4 - 100000

As you can see, the naive implementation is ~100 times slower than odo. And ~10 times slower than using pandas

answered Oct 13 '22 19:10

Roger Thomas

Here are some stats to support the answer from @Mike Tung. executemany out performs execute. It was hard to reach 315 inserts in 1 second with execute while with executemany I achieved 25,000 inserts.

Base machine configuration -

2.7 GHz Dual-Core Intel Core i5
16 GB 1867 MHz DDR3
Flash Storage

Results:

cursor.execute: 250 Inserts to max 315 Inserts in one second
cursor.executemany: 25,000 Inserts in one second

answered Oct 13 '22 19:10

user 923227

Take the commit out the for:

for row in csv_data:
     cursor.execute(query,row)
db.commit()

It will do less work and will be faster

answered Oct 13 '22 20:10

nacho

Related questions
                            
                                HoughCircles circle detection using opencv and python-
                            
                                Python MatplotLib plot x-axis with first x-axis value labeled as 1 (instead of 0)
                            
                                Computing Euclidean distance for numpy in python
                            
                                Possible to create a @synchronized decorator that's aware of a method's object?
                            
                                Django: set_password isn't hashing passwords?
                            
                                Python/Boto - Writing to AWS CloudWatch Logs without sequence token
                            
                                Complexity of enumerate
                            
                                sudo pip install python-Levenshtein failed with error code 1
                            
                                Python Function Return Loop
                            
                                Pandas how to delete alternate rows [duplicate]
                            
                                Bold font in Label with setBold method
                            
                                Python Selenium On Local HTML String
                            
                                PIL Image to QPixmap conversion issue
                            
                                Overwrite a file with Dropbox API v2 in Python
                            
                                Cufflinks for plotly: setting cufflinks config options launches
                            
                                What is the best way to check URL change with Selenium in Python?
                            
                                Connecting to AWS Elasticsearch instance using Python
                            
                                Plot normal distribution in 3D
                            
                                Pandas DENSE RANK
                            
                                Dictionary changed size during iteration - Code works in Py2 Not in Py3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Speeding (Bulk) Insert into MySQL with Python

Tags:

python

mysql

csv

Pedro Quadros

People also ask

5 Answers

Mike Tung

Alex

Roger Thomas

user 923227

nacho

Recent Activity

Donate For Us