Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best practices of commit() when inserting many rows

[Situation-Aim]
We have a data-set in CSV of 1 million rows. These need to be imported into a mysql 5.6 database, table with innodb engine, with following columns (matching csv file):
Column1: bigint(15)
Column2: datetime

To import this data, we could use the following python script:

#python version: 3.4.3
#mysql connector version: 2.0.4
#Note: this is just a simple example script!

import mysql.connector
import csv

#Set-up db connection
cnx = mysql.connector.connect(user='user', password='secret', database='mydatabase')
cursor = cnx.cursor()

#read/ open csv
reader = csv.reader(open("C:/test.csv", "rt"))

#ignore header
next(reader)

#read CSV and send to mysql
for row in reader:
    id=row[0]
    adate = row[1]
    cursor.execute("""INSERT INTO mytable VALUES ('%s','%s')""" % (id,adate))

#commit the query
cnx.commit()


[Questions]
What is conceptually happening when I put the cnx.commit() inside the for loop as follows:

for row in reader:
    id=row[0]
    adate = row[1]
    cursor.execute("""INSERT INTO mytable VALUES ('%s','%s')""" % (id,adate))    
    cnx.commit()
  • If I understood it correctly, it is the difference between forcing mysql to WRITE after each iteration opposed to mysql storing everything in its cache and then sending the commit() - write command.
  • What are the performance gains/ drawbacks of doing this?
  • Are there any conventions/ best practices or practices you adhere to and why?

Thanks!

like image 906
Mark Avatar asked Oct 31 '22 01:10

Mark


1 Answers

You've shown two extreme positions.

One is to commit only after your megarow of insertions. The other is to commit after every row.

Both of these have poor performance. The first is bad because MySQL creates a big transaction, and then at the end has to commit the whole thing. This takes a lot of RAM or temp space. The second is bad because it commits many little tiny transactions.

The best performing way to do this kind of bulk insert is to commit every thousand rows or so. But that's just a little harder to program than the alternatives. In my experience, committing after each thousand rows or so is several times faster than the alternatives, so it's worth the small amount of extra complexity.

Another fast (but brittle) way to do this kind of bulk load is with LOAD DATA INFILE, about which you can read here: https://dev.mysql.com/doc/refman/5.6/en/load-data.html

There's one little twist on your megarow commit: MySQL connections can have an autocommit mode set. In that case your first example is equivalent to your second. By default that mode is disabled in the python connector.

like image 100
O. Jones Avatar answered Nov 09 '22 06:11

O. Jones