I am writing some python code to do some calculation and write the result to file. Here is my current code:
for name, group in data.groupby('Date'):
df = lot_of_numpy_calculations(group)
with open('result.csv', 'a') as f:
df.to_csv(f, header=False, index=False)
both calculation and write take sometimes. I read some article about async in python, but I didn't know how to implement it. Is there a easy way to optimize this loop so that it doesn't wait until the writing finish and start the next iteration?
Since neither numpy nor pandas io are asyncio aware, this might be a better use case for threads than for asyncio. (Also, asyncio based solutions will use threads behind the scenes anyway.)
For example, this code spawns a writer thread to which you submit work using a queue:
import threading, queue
to_write = queue.Queue()
def writer():
# Call to_write.get() until it returns None
for df in iter(to_write.get, None):
with open('result.csv', 'a') as f:
df.to_csv(f, header=False, index=False)
threading.Thread(target=writer).start()
for name, group in data.groupby('Date'):
df = lot_of_numpy_calculations(group)
to_write.put(df)
# enqueue None to instruct the writer thread to exit
to_write.put(None)
Note that, if writing turns out to be consistently slower than the calculation, the queue will keep accumulating data frames, which might end up consuming a lot of memory. In that case be sure to provide a maximum size for the queue by passing the maxsize
argument to the constructor.
Also, consider that re-opening the file for each write can slow down writing. If the amount of data written is small, perhaps you could get better performance by opening the file beforehand.
Since most operating systems don't support asynchronous file I/O, common cross-platform approach now is to use threads.
For example, aiofiles modules wraps thread pool to provide file I/O API for asyncio.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With