Using the multiprocessing.Pool.map() function with keyword arguments?

Tags:

multiprocessing

I am trying to pass the keyword arguments to the map function in Python's multiprocessing.Pool instance.

Extrapolating from Using map() function with keyword arguments, I know I can use functools.partial() such as the following:

from multiprocessing import Pool
from functools import partial
import sys

# Function to multiprocess
def func(a, b, c, d):
    print(a * (b + 2 * c - d))
    sys.stdout.flush()

if __name__ == '__main__':
    p = Pool(2)
    # Now, I try to call func(a, b, c, d) for 10 different a values,
    # but the same b, c, d values passed in as keyword arguments
    a_iter = range(10)
    kwargs = {'b': 1, 'c': 2, 'd': 3}

    mapfunc = partial(func, **kwargs)
    p.map(mapfunc, a_iter)

The output is correct:

Is this the best practice (most "pythonic" way) to do so? I felt that:

1) Pool is commonly used;

2) Keyword arguments are commonly used;

3) But the combined usage like my example above is a little bit like a "hacky" way to achieve this.

211

asked Mar 03 '16 01:03

1 Answers

Using partial may be suboptimal if the default arguments are large. The function passed to map is repeatedly pickle-ed when sent to the workers (once for every argument in the iterable); a global Python function is (essentially) pickle-ed by sending the qualified name (because the same function is defined on the other side without needing to transfer any data) while partial is pickle-ed as the pickle of the function and all the provided arguments.

If kwargs is all small primitives, as in your example, this doesn't really matter; the incremental cost of sending along the extra arguments is trivial. But if kwargs is big, say, kwargs = {'b': [1] * 10000, 'c': [2] * 20000, 'd': [3]*30000}, that's a nasty price to pay.

In that case, you have some options:

Roll your own function at the global level that works like partial, but pickles differently:
```
class func_a_only(a):
    return func(a, 1, 2, 3)
```
Using the initializer argument to Pool so each worker process sets up state once, instead of once per task, allowing you to ensure data is available even if you're working in spawn based environment (e.g. Windows)
Using Managers to share a single copy of data among all processes

And probably a handful of other approaches. Point is, partial is fine for arguments that don't produce huge pickles, but it can kill you if the bound arguments are huge.

Note: In this particular case, if you're in Python 3.3+, you don't actually need partial, and avoiding the dict in favor of tuples saves a trivial amount of overhead. Without adding any new functions, just some imports, you could replace:

kwargs = {'b': 1, 'c': 2, 'd': 3}
mapfunc = partial(func, **kwargs)
p.map(mapfunc, a_iter)

with:

from itertools import repeat

p.starmap(func, zip(a_iter, repeat(1), repeat(2), repeat(3)))

to achieve a similar effect. To be clear, there is nothing wrong with partial that this "fixes" (both approaches would have the same problem with pickling large objects), this is just an alternate approach that is occasionally useful.

124

answered Nov 15 '22 00:11

ShadowRanger

Related questions
                            
                                Identify unique groupings of polygons in Geopandas / Shapely
                            
                                Pandas - Get dummies for only certain values
                            
                                Unit test packages Maven style convention
                            
                                Using a specific flatpage in a template
                            
                                Multi-part form using Flask / WTForms
                            
                                pandas - groupby and filtering for consecutive values
                            
                                Mac OS X El Capitan - Scrapy/Python ImportError: cannot import name xmlrpc_client
                            
                                Most efficient way to find neighbors in list
                            
                                Django queryset with annotate, why is GROUP BY applied to all fields?
                            
                                Automate the Boring Stuff Chapter 6 Table Printer Almost Done
                            
                                Generate Python type hints with SWIG
                            
                                RuntimeWarning: invalid value encountered in long_scalars
                            
                                How do I implement red noise?
                            
                                Ipython/Jupyter Notebook HTML misaligned when viewing mobile browser
                            
                                Headless endless scroll selenium
                            
                                Wheel names are platform independent even though my package includes compiled libraries
                            
                                Pythonic way to aggregate object properties in memory efficient way?
                            
                                Linker errors with libmsodbcsql-13.0.so.0.0 preventing pyODBC to MS SQL connection. CentOS 7
                            
                                Multimedia Keys in Python (Linux)
                            
                                Is it possible to store python objects (specifically sklearn models) in memory mapped files?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using the multiprocessing.Pool.map() function with keyword arguments?

Tags:

python

multiprocessing

Yuxiang Wang

People also ask

1 Answers

ShadowRanger

Recent Activity

Donate For Us