Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pymssql/pyodbc performance (cursor.execute) is very slow when querying large SQL Server table

Tags:

We have a large view on SQL Server (with ca. 500M records). Since it won't fit into memory, I was thinking of processing in chunks using fetchmany like below:

with pymssql.connect(host, user, pass, db) as conn:

    query = f"SELECT * FROM view_name;"

    with conn.cursor() as cursor, futures.ThreadPoolExecutor(3) as executor:
        cursor.execute(query)
        chunk_size = 5000
        data = cursor.fetchmany(chunk_size)

        while data:
            future_rs = executor.submit(process_chunk, data)
            data = cursor.fetchmany(chunk_size)

However, looks like cursor.execute actually tries to fetch all rows before I can call fetchmany as it is extremely slow.

My understanding from docs is that cursor.execute should only prepare query and not materialize full results?

How would you process such large table/view within manageable time?

PS: I also tried pyodbc, it's same issue. Changing query to select top 100 * from view_name is fast, as expected.

like image 303
chhantyal Avatar asked Sep 04 '19 16:09

chhantyal


1 Answers

Ok, after quite some debugging session, I have a solution.

Part of the issue turned out to be underlying view which is very slow. I misjudged this because database client like DBeaver returns result pretty fast (likely because it applies pagination to query behind the scene?). Anyway, what I was trying to do with cursor.fetchmany, I did it with Database feature.

SQL Server 12 and later has very nice pagination feature using OFFSET and FETCH NEXT. So my solution looks something like this:

offset = 0
offset_increment = 200000

def get_chunk(cursor, offset):
    query = f"""
            SELECT * FROM table ORDER BY some_col 
            OFFSET {offset} ROWS FETCH NEXT {offset_incriment} ROWS ONLY;
            """
    return cursor.execute(query).fetchall()

with futures.ThreadPoolExecutor(6) as executor:
    chunk = get_chunk(query, offset)

    while chunk:
        executor.submit(process_chunk, chunk)
        offset += offset_increment
        chunk = get_chunk(query, offset)

So the implementation here is to:

  • Use SQL Server pagination feature with OFFSET and FETCH NEXT to get only limited number of rows.
  • Process chunks in parallel using multiple threads. You could also parallelize SQL query execution part to make it faster. That needs bit more work as you need to know when to stop.

This is basic idea behind my solution. Above code is just an example, in reality I had to do many more tuning in my project based on resource usage (mainly memory). You can also you ProcessPoolExecutor to do multiprocessing instead of threading. Idea is same, code needs bit of changes as multiprocessing needs only pickalable objects.

So using pagination and processing results concurrently in chunks, you can work with large tables/views pretty easily :)

like image 117
chhantyal Avatar answered Oct 13 '22 01:10

chhantyal