Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python non blocking write csv file

I am writing some python code to do some calculation and write the result to file. Here is my current code:

for name, group in data.groupby('Date'):
    df = lot_of_numpy_calculations(group)

    with open('result.csv', 'a') as f:
        df.to_csv(f, header=False, index=False)

both calculation and write take sometimes. I read some article about async in python, but I didn't know how to implement it. Is there a easy way to optimize this loop so that it doesn't wait until the writing finish and start the next iteration?

like image 234
JOHN Avatar asked Apr 27 '18 16:04

JOHN


2 Answers

Since neither numpy nor pandas io are asyncio aware, this might be a better use case for threads than for asyncio. (Also, asyncio based solutions will use threads behind the scenes anyway.)

For example, this code spawns a writer thread to which you submit work using a queue:

import threading, queue

to_write = queue.Queue()

def writer():
    # Call to_write.get() until it returns None
    for df in iter(to_write.get, None):
        with open('result.csv', 'a') as f:
            df.to_csv(f, header=False, index=False)
threading.Thread(target=writer).start()

for name, group in data.groupby('Date'):
    df = lot_of_numpy_calculations(group)
    to_write.put(df)
# enqueue None to instruct the writer thread to exit
to_write.put(None)

Note that, if writing turns out to be consistently slower than the calculation, the queue will keep accumulating data frames, which might end up consuming a lot of memory. In that case be sure to provide a maximum size for the queue by passing the maxsize argument to the constructor.

Also, consider that re-opening the file for each write can slow down writing. If the amount of data written is small, perhaps you could get better performance by opening the file beforehand.

like image 157
user4815162342 Avatar answered Oct 07 '22 19:10

user4815162342


Since most operating systems don't support asynchronous file I/O, common cross-platform approach now is to use threads.

For example, aiofiles modules wraps thread pool to provide file I/O API for asyncio.

like image 39
Mikhail Gerasimov Avatar answered Oct 07 '22 20:10

Mikhail Gerasimov