Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Updating millions of rows in MySQL -- when to commit

Tags:

python

mysql

I have a for loop that goes through millions of objects. What would be the suggested way to commit this? Here are a few examples I was thinking of:

# after each
for item in items:
    cursor.execute()
    conn.commit()

# at the end
for item in items:
    cursor.execute()
conn.commit()

# after N items
for n, item in enumerate(items):
    cursor.execute()
    if n % N == 0:
        conn.commit()
conn.commit()

Which of the above would be the most efficient?

like image 997
David542 Avatar asked Sep 07 '14 18:09

David542


2 Answers

You're asking, should I commit ...

  1. just once at the end of the huge operation
  2. after each row update
  3. after every N items.

First of all, if you're working with MyISAM, don't worry about it. MyISAM doesn't implement transactions.

If your system won't be corrupted by committing only some of your row changes, you should commit after every N items. That's because committing after every item will slow down your processing a lot. Only committing once at the end of the huge operation is likely either to blow out your rollback space, or take an unspeakably long time during which your other data base users will have slowdowns.

I've been through this countless times. Any N greater than about 20 will do fine.

like image 129
O. Jones Avatar answered Oct 25 '22 02:10

O. Jones


Arguably the "at the end" one, since you BEGIN and COMMIT only once and it's one single transaction with its own scope. It's easier from concurrency point of view: basically the transaction says: this is my table now, don't touch it anyone.

If you commit multiple times (the other two solutions), you BEGIN and COMMIT a transaction many times (there's a next transaction started right after you commit). This means more chance of interruption from other ongoing DB operations. Also - those operations take time themselves.

You should run a benchmark simulating your use-case, though. I'd be curious to know if depending on some conditions (number of rows to insert, session configuration, data type, indexes used) one or the other solution may prevail.

like image 29
PawelP Avatar answered Oct 25 '22 02:10

PawelP