I am making a program that reads multiple files and writes a summary of each file to an ouput file. The size of the output file is rather big, so keeping it in memory is not a good idea. I am trying to develop a multiprocessing way of doing it. So far, the simplest way I was able to come with is:
pool = Pool(processes=4)
it = pool.imap_unordered(do, glob.iglob(aglob))
for summary in it:
writer.writerows(summary)
do is the function that summarizes the file. writer is a csv.writer object
But the truth is that I still do not understand multiprocessing.imap completely. Does this mean that 4 summaries are calculated in parallel and that when I read one of it, the 5th starts to be calculated?
Is there a better way of doing this?
Thanks.
processes=4
means that multiprocessing will start a pool with four worker processes and send the work items to them. Ideally, if you system supports it, i.e. either you have four cores, or the workers are not totally CPU-bound, 4 work items will be processed in parallel.
I don't know the implementation of multiprocessing, but I think that the results of do
will be cached internally even before you read them out, i.e. the 5th item will be computed once any process is done with an item from the first wave.
If there is a better way depends on the type of your data. How many files there are in total that need processing, how large the summary
objects are etc. If you have many files (say, more than 10k), batching them might be an option, via
it = pool.imap_unordered(do, glob.iglob(aglob), chunksize=100)
This way, a work item is not one file, but 100 files, and results are also reported in batches of 100. If you have many work items, chunking lowers the overhead of pickling and unpickling the result objects.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With