I am trying to pass the keyword arguments to the map
function in Python's multiprocessing.Pool
instance.
Extrapolating from Using map() function with keyword arguments, I know I can use functools.partial()
such as the following:
from multiprocessing import Pool
from functools import partial
import sys
# Function to multiprocess
def func(a, b, c, d):
print(a * (b + 2 * c - d))
sys.stdout.flush()
if __name__ == '__main__':
p = Pool(2)
# Now, I try to call func(a, b, c, d) for 10 different a values,
# but the same b, c, d values passed in as keyword arguments
a_iter = range(10)
kwargs = {'b': 1, 'c': 2, 'd': 3}
mapfunc = partial(func, **kwargs)
p.map(mapfunc, a_iter)
The output is correct:
0
2
4
6
8
10
12
14
16
18
Is this the best practice (most "pythonic" way) to do so? I felt that:
1) Pool
is commonly used;
2) Keyword arguments are commonly used;
3) But the combined usage like my example above is a little bit like a "hacky" way to achieve this.
Passing Keyword Arguments to Multiprocessing Processes We can also pass in arguments corresponding to the parameter name using the kwargs parameter in the Process class. Instead of passing a tuple, we pass a dictionary to kwargs where we specify the argument name and the variable being passed in as that argument.
The process pool provides a parallel map function via Pool. map(). Recall that the built-in map() function will apply a given function to each item in a given iterable. Return an iterator that applies function to every item of iterable, yielding the results.
Pool is generally used for heterogeneous tasks, whereas multiprocessing. Process is generally used for homogeneous tasks. The Pool is designed to execute heterogeneous tasks, that is tasks that do not resemble each other. For example, each task submitted to the process pool may be a different target function.
Using partial
may be suboptimal if the default arguments are large. The function passed to map
is repeatedly pickle
-ed when sent to the workers (once for every argument in the iterable); a global Python function is (essentially) pickle
-ed by sending the qualified name (because the same function is defined on the other side without needing to transfer any data) while partial
is pickle
-ed as the pickle
of the function and all the provided arguments.
If kwargs
is all small primitives, as in your example, this doesn't really matter; the incremental cost of sending along the extra arguments is trivial. But if kwargs
is big, say, kwargs = {'b': [1] * 10000, 'c': [2] * 20000, 'd': [3]*30000}
, that's a nasty price to pay.
In that case, you have some options:
Roll your own function at the global level that works like partial
, but pickle
s differently:
class func_a_only(a):
return func(a, 1, 2, 3)
Using the initializer
argument to Pool
so each worker process sets up state once, instead of once per task, allowing you to ensure data is available even if you're working in spawn based environment (e.g. Windows)
Using Manager
s to share a single copy of data among all processes
And probably a handful of other approaches. Point is, partial
is fine for arguments that don't produce huge pickle
s, but it can kill you if the bound arguments are huge.
Note: In this particular case, if you're in Python 3.3+, you don't actually need partial
, and avoiding the dict
in favor of tuple
s saves a trivial amount of overhead. Without adding any new functions, just some imports, you could replace:
kwargs = {'b': 1, 'c': 2, 'd': 3}
mapfunc = partial(func, **kwargs)
p.map(mapfunc, a_iter)
with:
from itertools import repeat
p.starmap(func, zip(a_iter, repeat(1), repeat(2), repeat(3)))
to achieve a similar effect. To be clear, there is nothing wrong with partial
that this "fixes" (both approaches would have the same problem with pickling large objects), this is just an alternate approach that is occasionally useful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With