Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python MySQLdb SScursor slow in comparison to exporting and importing from CSV-file. Speedup possible?

As part of building a Data Warehouse, I have to query a source database table for about 75M rows.

What I want to do with the 75M rows is some processing and then adding the result into another database. Now, this is quite a lot of data, and I've had success with mainly two approaches:

1) Exporting the query to a CSV file using the "SELECT ... INTO" capabilities of MySQL and using the fileinput module of python to read it, and

2) connecting to the MySQL database using MySQLdb's SScursor (the default cursor puts the query in the memory, killing the python script) and fetch the results in chunks of about 10k rows (which is the chunk size I've found to be the fastest).

The first approach is a SQL query executed "by hand" (takes about 6 minutes) followed by a python script reading the csv-file and processing it. The reason I use fileinput to read the file is that fileinput doesn't load the whole file into the memory from the beginning, and works well with larger files. Just traversing the file (reading every line in the file and calling pass) takes about 80 seconds, that is 1M rows/s.

The second approach is a python script executing the same query (also takes about 6 minutes, or slightly longer) and then a while-loop fetching chunks of rows for as long as there is any left in the SScursor. Here, just reading the lines (fetching one chunk after another and not doing anything else) takes about 15 minutes, or approximately 85k rows/s.

The two numbers (rows/s) above are perhaps not really comparable, but when benchmarking the two approaches in my application, the first one takes about 20 minutes (of which about five is MySQL dumping into a CSV file), and the second one takes about 35 minutes (of which about five minutes is the query being executed). This means that dumping and reading to/from a CSV file is about twice as fast as using an SScursor directly.

This would be no problem, if it did not restrict the portability of my system: a "SELECT ... INTO" statement requires MySQL to have writing privileges, and I suspect that is is not as safe as using cursors. On the other hand, 15 minutes (and growing, as the source database grows) is not really something I can spare on every build.

So, am I missing something? Is there any known reason for SScursor to be so much slower than dumping/reading to/from a CSV file, such that fileinput is C optimized where SScursor is not? Any ideas on how to proceed with this problem? Anything to test? I would belive that SScursor could be as fast as the first approach, but after reading all I can find about the matter, I'm stumped.

Now, to the code:

Not that I think the query is of any problem (it's as fast as I can ask for and takes similar time in both approaches), but here it is for the sake of completeness:

    SELECT LT.SomeID, LT.weekID, W.monday, GREATEST(LT.attr1, LT.attr2)
    FROM LargeTable LT JOIN Week W ON LT.weekID = W.ID
    ORDER BY LT.someID ASC, LT.weekID ASC;

The primary code in the first approach is something like this

    import fileinput
    INPUT_PATH = 'path/to/csv/dump/dump.csv'
    event_list = []
    ID = -1

    for line in fileinput.input([INPUT_PATH]):
            split_line = line.split(';')
            if split_line[0] == ID:
                event_list.append(split_line[1:])
            else:
                process_function(ID,event_list)
                event_list = [ split_line[1:] ]
                ID = split_line[0]

    process_function(ID,event_list)

The primary code in the second approach is:

    import MySQLdb
    ...opening connection, defining SScursor called ssc...
    CHUNK_SIZE = 100000

    query_stmt = """SELECT LT.SomeID, LT.weekID, W.monday,
                    GREATEST(LT.attr1, LT.attr2)
                    FROM LargeTable LT JOIN Week W ON LT.weekID = W.ID
                    ORDER BY LT.someID ASC, LT.weekID ASC"""
    ssc.execute(query_stmt)

    event_list = []
    ID = -1

    data_chunk = ssc.fetchmany(CHUNK_SIZE)
    while data_chunk:
        for row in data_chunk:
            if row[0] == ID:
                event_list.append([ row[1], row[2], row[3] ])
            else:
                process_function(ID,event_list)
                event_list = [[ row[1], row[2], row[3] ]]
                ID = row[0]
        data_chunk = ssc.fetchmany(CHUNK_SIZE)

    process_function(ID,event_list)

At last, I'm on Ubuntu 13.04 with MySQL server 5.5.31. I use Python 2.7.4 with MySQLdb 1.2.3. Thank you for staying with me this long!

like image 607
Lyckberg Avatar asked Jul 09 '13 16:07

Lyckberg


1 Answers

After using cProfile I found a lot of time being spent implicitly constructing Decimal objects, since that was the numeric type returned from the SQL query into my Python script. In the first approach, the Decimal value was written to the CSV file as an integer and then read as such by the Python script. The CSV file I/O "flattened" the data, making the script faster. The two scripts are now about the same speed (the second approach is still just a tad slower).

I also did some conversion of the date in the MySQL database to integer type. My query is now:

SELECT LT.SomeID,
       LT.weekID,
       CAST(DATE_FORMAT(W.monday,'%Y%m%d') AS UNSIGNED),
       CAST(GREATEST(LT.attr1, LT.attr2) AS UNSIGNED)
FROM LargeTable LT JOIN Week W ON LT.weekID = W.ID
ORDER BY LT.someID ASC, LT.weekID ASC;

This almost eliminates the difference in processing time between the two approaches.

The lesson here is that when doing large queries, post processing of data types DOES MATTER! Rewriting the query to reducing function calls in Python can improve the overall processing speed significantly.

like image 154
Air Avatar answered Nov 02 '22 11:11

Air