We have a large view on SQL Server (with ca. 500M records). Since it won't fit into memory, I was thinking of processing in chunks using fetchmany like below:
with pymssql.connect(host, user, pass, db) as conn:
query = f"SELECT * FROM view_name;"
with conn.cursor() as cursor, futures.ThreadPoolExecutor(3) as executor:
cursor.execute(query)
chunk_size = 5000
data = cursor.fetchmany(chunk_size)
while data:
future_rs = executor.submit(process_chunk, data)
data = cursor.fetchmany(chunk_size)
However, looks like cursor.execute
actually tries to fetch all rows before I can call fetchmany as it is extremely slow.
My understanding from docs is that cursor.execute
should only prepare query and not materialize full results?
How would you process such large table/view within manageable time?
PS: I also tried pyodbc
, it's same issue. Changing query to select top 100 * from view_name
is fast, as expected.
Ok, after quite some debugging session, I have a solution.
Part of the issue turned out to be underlying view which is very slow. I misjudged this because database client like DBeaver returns result pretty fast (likely because it applies pagination to query behind the scene?). Anyway, what I was trying to do with cursor.fetchmany
, I did it with Database feature.
SQL Server 12 and later has very nice pagination feature using OFFSET
and FETCH NEXT
. So my solution looks something like this:
offset = 0
offset_increment = 200000
def get_chunk(cursor, offset):
query = f"""
SELECT * FROM table ORDER BY some_col
OFFSET {offset} ROWS FETCH NEXT {offset_incriment} ROWS ONLY;
"""
return cursor.execute(query).fetchall()
with futures.ThreadPoolExecutor(6) as executor:
chunk = get_chunk(query, offset)
while chunk:
executor.submit(process_chunk, chunk)
offset += offset_increment
chunk = get_chunk(query, offset)
So the implementation here is to:
OFFSET
and FETCH NEXT
to get only limited number of rows.This is basic idea behind my solution. Above code is just an example, in reality I had to do many more tuning in my project based on resource usage (mainly memory). You can also you ProcessPoolExecutor
to do multiprocessing instead of threading. Idea is same, code needs bit of changes as multiprocessing needs only pickalable objects.
So using pagination and processing results concurrently in chunks, you can work with large tables/views pretty easily :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With