Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speeding (Bulk) Insert into MySQL with Python

Tags:

python

mysql

csv

I'm deploying an application to consume some .csv data. I want to copy them to a MySQL table. With some help from the stackoverflow users I wrote the code bellow:

import csv
import MySQLdb

db = MySQLdb.connect(   host = "dbname.description.host.com",
                        user = "user",
                        passwd = "key",
                        db = "dbname")
cursor = db.cursor()

query = 'INSERT INTO table_name(column,column_1,column_2,column_3)
VALUES(%s, %s, %s, %s)'                                                         

csv_data = csv.reader(file('file_name'))

for row in csv_data:
     cursor.execute(query,row)
     db.commit()

cursor.close()

The problem is, currently, the process is so slow and I need to speed the things up.

like image 812
Pedro Quadros Avatar asked Oct 03 '17 11:10

Pedro Quadros


People also ask

Does MySQL support bulk insert?

The INSERT statement in MySQL also supports the use of VALUES syntax to insert multiple rows as a bulk insert statement. To do this, include multiple lists of column values, each enclosed within parentheses and separated by commas.

How can increase insert query performance in MySQL?

To optimize insert speed, combine many small operations into a single large operation. Ideally, you make a single connection, send the data for many new rows at once, and delay all index updates and consistency checking until the very end.

How do you add user input to a database in Python?

connect ( 'mydatabase. db' ) cursor = conn. cursor () #create the salesman table cursor. execute("CREATE TABLE salesman(salesman_id n(5), name char(30), city char(35), commission decimal(7,2));") s_id = input('Salesman ID:') s_name = input('Name:') s_city = input('City:') s_commision = input('Commission:') cursor.


5 Answers

you can use executemany to batch the job as follows

import csv
import MySQLdb

db = MySQLdb.connect(   host = "dbname.description.host.com",
                        user = "user",
                        passwd = "key",
                        db = "dbname")
cursor = db.cursor()

query = 'INSERT INTO table_name(column,column_1,column_2,column_3)
VALUES(%s, %s, %s, %s)'                                                         

csv_data = csv.reader(file('file_name'))

my_data = []
for row in csv_data:
     my_data.append(tuple(row))

cursor.executemany(query, my_data)
cursor.close()
like image 117
Mike Tung Avatar answered Oct 13 '22 20:10

Mike Tung


The solution is to use batch insert from MySQL.

So you need to take all the values you want to insert and transform them into a single string used as parameter for the execute() method.

In the end you SQL should look like:

INSERT INTO table_name (`column`, `column_1`, `column_2`, `column_3`) VALUES('1','2','3','4'),('4','5','6','7'),('7','8','9','10');

Here is an example:

#function to transform your list into a string
def stringify(v): 
    return "('%s', '%s', %s, %s)" % (v[0], v[1], v[2], v[3])

#transform all to string
v = map(stringify, row)

#glue them together
batchData = ", ".join(e for e in v)

#complete the SQL
sql = "INSERT INTO `table_name`(`column`, `column_1`, `column_2`, `column_3`) \
VALUES %s" % batchData

#execute it
cursor.execute(sql)
db.commit()
like image 26
Alex Avatar answered Oct 13 '22 20:10

Alex


the code you are using is ultra inefficient for a number of reasons as you are committing each of your data one row at a time (which would be what you want for a transactional DB or process) but not for a one-off dump.

There are a number of ways to speed this up ranging from great to not so great. Here are 4 approaches, including the naive implementation (above)

#!/usr/bin/env python
import pandas as pd
import numpy as np
import odo
import profilehooks
import sqlalchemy
import csv
import os


def create_test_data():
    n = 100000
    df = pd.DataFrame(dict(
        id=np.random.randint(0, 1000000, n),
        col1=np.random.choice(['hello', 'world', 'python', 'large string for testing ' * 10], n),
        col2=np.random.randint(-1000000, 1000000, n),
        col3=np.random.randint(-9000000, 9000000, n),
        col4=(np.random.random(n) - 0.5) * 99999
    ), columns=['id', 'col1', 'col2', 'col3', 'col4'])
    df.to_csv('tmp.csv', index=False)


@profilehooks.timecall
def using_pandas(table_name, uri):
    df = pd.read_csv('tmp.csv')
    df.to_sql(table_name, con=uri, if_exists='append', index=False)


@profilehooks.timecall
def using_odo(table_name, uri):
    odo.odo('tmp.csv', '%s::%s' % (uri, table_name))


@profilehooks.timecall
def using_cursor(table_name, uri):
    engine = sqlalchemy.create_engine(uri)
    query = 'INSERT INTO {} (id, col1, col2, col3, col4) VALUES(%s, %s, %s, %s, %s)'
    query = query.format(table_name)
    con = engine.raw_connection()
    with con.cursor() as cursor:
        with open('tmp.csv') as fh:
            reader = csv.reader(fh)
            next(reader)  # Skip firt line (headers)
            for row in reader:
                cursor.execute(query, row)
                con.commit()
    con.close()


@profilehooks.timecall
def using_cursor_correct(table_name, uri):
    engine = sqlalchemy.create_engine(uri)
    query = 'INSERT INTO {} (id, col1, col2, col3, col4) VALUES(%s, %s, %s, %s, %s)'
    query = query.format(table_name)
    with open('tmp.csv') as fh:
        reader = csv.reader(fh)
        next(reader)  # Skip firt line (headers)
        data = list(reader)
    engine.execute(query, data)


def main():
    uri = 'mysql+pymysql://root:%s@localhost/test' % os.environ['pass']

    engine = sqlalchemy.create_engine(uri)
    for i in (1, 2, 3, 4):
        engine.execute("DROP TABLE IF EXISTS table%s" % i)
        engine.execute("""
            CREATE TABLE table%s(
                id INT,
                col1 VARCHAR(255),
                col2 INT,
                col3 INT,
                col4 DOUBLE
            );
        """ % i)
    create_test_data()

    using_odo('table1', uri)
    using_pandas('table4', uri)
    using_cursor_correct('table3', uri)
    using_cursor('table2', uri)

    for i in (1, 2, 3, 4):
        count = pd.read_sql('SELECT COUNT(*) as c FROM table%s' % i, con=uri)['c'][0]
        print("Count for table%s - %s" % (i, count))


if __name__ == '__main__':
    main()

The odo method is the fastest (uses MySQL LOAD DATA INFILE under the hood) Next is Pandas (critical code paths are optimized) Next is using a raw cursor but inserting rows in bulk Last is the naive method, committing one row at a time

Here are some examples timings running locally against a local MySQL server.

using_odo (./test.py:29): 0.516 seconds

using_pandas (./test.py:23): 3.039 seconds

using_cursor_correct (./test.py:50): 12.847 seconds

using_cursor (./test.py:34): 43.470 seconds

Count for table1 - 100000

Count for table2 - 100000

Count for table3 - 100000

Count for table4 - 100000

As you can see, the naive implementation is ~100 times slower than odo. And ~10 times slower than using pandas

like image 35
Roger Thomas Avatar answered Oct 13 '22 19:10

Roger Thomas


Here are some stats to support the answer from @Mike Tung. executemany out performs execute. It was hard to reach 315 inserts in 1 second with execute while with executemany I achieved 25,000 inserts.

Base machine configuration -

2.7 GHz Dual-Core Intel Core i5
16 GB 1867 MHz DDR3
Flash Storage

Results:

cursor.execute: 250 Inserts to max 315 Inserts in one second
cursor.executemany: 25,000 Inserts in one second
like image 27
user 923227 Avatar answered Oct 13 '22 19:10

user 923227


Take the commit out the for:

for row in csv_data:
     cursor.execute(query,row)
db.commit()

It will do less work and will be faster

like image 24
nacho Avatar answered Oct 13 '22 20:10

nacho