Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

multiprocessing imap_unordered in python

I am making a program that reads multiple files and writes a summary of each file to an ouput file. The size of the output file is rather big, so keeping it in memory is not a good idea. I am trying to develop a multiprocessing way of doing it. So far, the simplest way I was able to come with is:

pool = Pool(processes=4)
it = pool.imap_unordered(do, glob.iglob(aglob))
for summary in it:
    writer.writerows(summary)

do is the function that summarizes the file. writer is a csv.writer object

But the truth is that I still do not understand multiprocessing.imap completely. Does this mean that 4 summaries are calculated in parallel and that when I read one of it, the 5th starts to be calculated?

Is there a better way of doing this?

Thanks.

like image 594
Hernan Avatar asked Jun 10 '11 05:06

Hernan


1 Answers

processes=4 means that multiprocessing will start a pool with four worker processes and send the work items to them. Ideally, if you system supports it, i.e. either you have four cores, or the workers are not totally CPU-bound, 4 work items will be processed in parallel.

I don't know the implementation of multiprocessing, but I think that the results of do will be cached internally even before you read them out, i.e. the 5th item will be computed once any process is done with an item from the first wave.

If there is a better way depends on the type of your data. How many files there are in total that need processing, how large the summary objects are etc. If you have many files (say, more than 10k), batching them might be an option, via

it = pool.imap_unordered(do, glob.iglob(aglob), chunksize=100)

This way, a work item is not one file, but 100 files, and results are also reported in batches of 100. If you have many work items, chunking lowers the overhead of pickling and unpickling the result objects.

like image 86
Torsten Marek Avatar answered Sep 30 '22 20:09

Torsten Marek