Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

multiprocessing pool.map call functions in certain order

Tags:

How can I make multiprocessing.pool.map distribute processes in numerical order?


More Info:
I have a program which processes a few thousand data files, making a plot of each one. I'm using a multiprocessing.pool.map to distribute each file to a processor and it works great. Sometimes this takes a long time, and it would be nice to look at the output images as the program is running. This would be a lot easier if the map process distributed the snapshots in order; instead, for the particular run I just executed, the first 8 snapshots analyzed were: 0, 78, 156, 234, 312, 390, 468, 546. Is there a way to make it distribute them more closely to in numerical order?


Example:
Here's a sample code which contains the same key elements, and show's the same basic result:

import sys
from multiprocessing import Pool
import time

num_proc  = 4; num_calls = 20; sleeper   = 0.1

def SomeFunc(arg):
    time.sleep(sleeper)
    print "%5d" % (arg),
    sys.stdout.flush()     # otherwise doesn't print properly on single line

proc_pool = Pool(num_proc)
proc_pool.map( SomeFunc, range(num_calls) )

Yields:

   0  4  2  6   1   5   3   7   8  10  12  14  13  11   9  15  16  18  17  19

Answer:

From @Hayden: Use the 'chunksize' parameter, def map(self, func, iterable, chunksize=None).

More Info:
The chunksize determines how many iterations are allocated to each processor at a time. My example above, for instance, uses a chunksize of 2---which means that each processor goes off and does its thing for 2 iterations of the function, then comes back for more ('check-in'). The trade-off behind chunksize is that there is overhead for the 'check-in' when the processor has to sync up with the others---suggesting you want a large chunksize. On the other hand, if you have large chunks, then one processor might finish its chunk while another-one has a long time left to go---so you should use a small chunksize. I guess the additional useful information is how much range there is, in how long each function call can take. If they really should all take the same amount of time - it's way more efficient to use a large chunk size. On the other hand, if some function calls could take twice as long as others, you want a small chunksize so that processors aren't caught waiting.

For my problem, every function call should take very close to the same amount of time (I think), so if I want the processes to be called in order, I'm going to sacrifice efficiency because of the check-in overhead.

like image 373
DilithiumMatrix Avatar asked Jul 27 '13 23:07

DilithiumMatrix


People also ask

Does Python multiprocessing preserve order?

Since map is guaranteed to preserve order, multiprocessing. Pool. map makes that guarantee too.

How do you pass multiple arguments in multiprocessing in Python?

Passing Keyword Arguments to Multiprocessing Processes We can also pass in arguments corresponding to the parameter name using the kwargs parameter in the Process class. Instead of passing a tuple, we pass a dictionary to kwargs where we specify the argument name and the variable being passed in as that argument.

How does pool map work Python?

The pool's map method chops the given iterable into a number of chunks which it submits to the process pool as separate tasks. The pool's map is a parallel equivalent of the built-in map method. The map blocks the main execution until all computations finish. The Pool can take the number of processes as a parameter.


1 Answers

The reason that this occurs is because each process is given a predefined amount of work to do at the start of the call to map which is dependant on the chunksize. We can work out the default chunksize by looking at the source for pool.map

chunksize, extra = divmod(len(iterable), len(self._pool) * 4)
if extra:
  chunksize += 1

So for a range of 20, and with 4 processes, we will get a chunksize of 2.

If we modify your code to reflect this we should get similar results to the results you are getting now:

proc_pool.map(SomeFunc, range(num_calls), chunksize=2)

This yields the output:

0 2 6 4 1 7 5 3 8 10 12 14 9 13 15 11 16 18 17 19

Now, setting the chunksize=1 will ensure that each process within the pool will only be given one task at a time.

proc_pool.map(SomeFunc, range(num_calls), chunksize=1)

This should ensure a reasonably good numerical ordering compared to that when not specifying a chunksize. For example a chunksize of 1 yields the output:

0 1 2 3 4 5 6 7 9 10 8 11 13 12 15 14 16 17 19 18

like image 100
Hayden Avatar answered Oct 24 '22 22:10

Hayden