Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using the python MySQLDB SScursor with nested queries

The typical MySQLdb library query can use a lot of memory and perform poorly in Python, when a large result set is generated. For example:

cursor.execute("SELECT id, name FROM `table`")
for i in xrange(cursor.rowcount):
    id, name = cursor.fetchone()
    print id, name

There is an optional cursor that will fetch just one row at a time, really speeding up the script and cutting the memory footprint of the script a lot.

import MySQLdb
import MySQLdb.cursors

conn = MySQLdb.connect(user="user", passwd="password", db="dbname", 
                       cursorclass = MySQLdb.cursors.SSCursor)
cur = conn.cursor()
cur.execute("SELECT id, name FROM users")
row = cur.fetchone()
while row is not None:
    doSomething()
    row = cur.fetchone()    
cur.close()
conn.close()

But I can't find anything about using SSCursor with with nested queries. If this is the definition of doSomething():

def doSomething()
    cur2 = conn.cursor()
    cur2.execute('select id,x,y from table2')
    rows = cur2.fetchall()
    for row in rows:
        doSomethingElse(row)
    cur2.close()

then the script throws the following error:

_mysql_exceptions.ProgrammingError: (2014, "Commands out of sync; you can't run this command now")

It sounds as if SSCursor is not compatible with nested queries. Is that true? If so that's too bad because the main loop seems to run too slowly with the standard cursor.

like image 710
shigeta Avatar asked Sep 10 '14 17:09

shigeta


1 Answers

This problem in discussed a bit in the MySQLdb User's Guide, under the heading of the threadsafety attribute (emphasis mine):

The MySQL protocol can not handle multiple threads using the same connection at once. Some earlier versions of MySQLdb utilized locking to achieve a threadsafety of 2. While this is not terribly hard to accomplish using the standard Cursor class (which uses mysql_store_result()), it is complicated by SSCursor (which uses mysql_use_result(); with the latter you must ensure all the rows have been read before another query can be executed.

The documentation for the MySQL C API function mysql_use_result() gives more information about your error message:

When using mysql_use_result(), you must execute mysql_fetch_row() until a NULL value is returned, otherwise, the unfetched rows are returned as part of the result set for your next query. The C API gives the error Commands out of sync; you can't run this command now if you forget to do this!

In other words, you must completely fetch the result set from any unbuffered cursor (i.e., one that uses mysql_use_result() instead of mysql_store_result() - with MySQLdb, that means SSCursor and SSDictCursor) before you can execute another statement over the same connection.

In your situation, the most direct solution would be to open a second connection to use while iterating over the result set of the unbuffered query. (It wouldn't work to simply get a buffered cursor from the same connection; you'd still have to advance past the unbuffered result set before using the buffered cursor.)

If your workflow is something like "loop through a big result set, executing N little queries for each row," consider looking into MySQL's stored procedures as an alternative to nesting cursors from different connections. You can still use MySQLdb to call the procedure and get the results, though you'll definitely want to read the documentation of MySQLdb's callproc() method since it doesn't conform to Python's database API specs when retrieving procedure outputs.


A second alternative is to stick to buffered cursors, but split up your query into batches. That's what I ended up doing for a project last year where I needed to loop through a set of millions of rows, parse some of the data with an in-house module, and perform some INSERT and UPDATE queries after processing each row. The general idea looks something like this:

QUERY = r"SELECT id, name FROM `table` WHERE id BETWEEN %s and %s;"
BATCH_SIZE = 5000

i = 0
while True:
    cursor.execute(QUERY, (i + 1, i + BATCH_SIZE))
    result = cursor.fetchall()

    # If there's no possibility of a gap as large as BATCH_SIZE in your table ids,
    # you can test to break out of the loop like this (otherwise, adjust accordingly):
    if not result:
        break

    for row in result:
        doSomething()

    i += BATCH_SIZE

One other thing I would note about your example code is that you can iterate directly over a cursor in MySQLdb instead of calling fetchone() explicitly over xrange(cursor.rowcount). This is especially important when using an unbuffered cursor, because the rowcount attribute is undefined and will give a very unexpected result (see: Python MysqlDB using cursor.rowcount with SSDictCursor returning wrong count).

like image 95
Air Avatar answered Nov 14 '22 23:11

Air