When I use a generator as an iterable argument with multiprocessing.Pool.map function: <pre class="prettyprint"><code>pool.map(func, iterable=(x for x in range(10))) </code></pre> It seems that the generator is fully exhausted before <code>func</code> is ever called. I want to yield each item and pass it to each process, thanks

Alas, this isn't well-defined. Here's a test case I'm running under Python 3.6.1: <pre class="prettyprint"><code>import multiprocessing as mp def e(i): if i % 1000000 == 0: print(i) if __name__ == '__main__': p = mp.Pool() def g(): for i in range(100000000): yield i print("generator done") r = p.map(e, g()) p.close() p.join() </code></pre> The first thing you see is the "generator done" message, and peak memory use is unreasonably high (precisely because, as you suspect, the generator is run to exhaustion before any work is passed out). However, replace the <code>map()</code> call like so: <pre class="prettyprint"><code>r = list(p.imap(e, g())) </code></pre> Now memory use remains small, and "generator done" appears at the output end. However, you won't wait long enough to see that, because it's horridly slow :-( <code>imap()</code> not only treats that iterable as an iterable, but effectively passes only 1 item at a time across process boundaries. To get speed back too, this works: <pre class="prettyprint"><code>r = list(p.imap(e, g(), chunksize=10000)) </code></pre> In real life, I'm much more likely to iterate over an <code>imap()</code> (or <code>imap_unordered()</code>) result than to force it into a list, and then memory use remains small for looping over the results too.

<code>multiprocessing.map</code> converts iterables without a <code>__len__</code> method to a list before processing. This is done to aid the calculation of chunksize, which the pool uses to group worker arguments and reduce the round trip cost of scheduling jobs. This is not optimal, especially when chunksize is 1, but since <code>map</code> must exhaust the iterator one way or the other, its usually not a significant issue. The relevant code is in <code>pool.py</code>. Notice its use of <code>len</code>: <pre class="prettyprint"><code>def _map_async(self, func, iterable, mapper, chunksize=None, callback=None, error_callback=None): ''' Helper function to implement map, starmap and their async counterparts. ''' if self._state != RUN: raise ValueError("Pool not running") if not hasattr(iterable, '__len__'): iterable = list(iterable) if chunksize is None: chunksize, extra = divmod(len(iterable), len(self._pool) * 4) if extra: chunksize += 1 if len(iterable) == 0: chunksize = 0 </code></pre>

How to use a generator as an iterable with Multiprocessing map function

Tags:

python

multiprocessing

When I use a generator as an iterable argument with multiprocessing.Pool.map function:

pool.map(func, iterable=(x for x in range(10)))

It seems that the generator is fully exhausted before func is ever called.

I want to yield each item and pass it to each process, thanks

748

asked Jun 22 '17 19:06

RustyShackleford

2 Answers

Alas, this isn't well-defined. Here's a test case I'm running under Python 3.6.1:

import multiprocessing as mp

def e(i):
    if i % 1000000 == 0:
        print(i)

if __name__ == '__main__':
    p = mp.Pool()
    def g():
        for i in range(100000000):
            yield i
        print("generator done")
    r = p.map(e, g())
    p.close()
    p.join()

The first thing you see is the "generator done" message, and peak memory use is unreasonably high (precisely because, as you suspect, the generator is run to exhaustion before any work is passed out).

However, replace the map() call like so:

r = list(p.imap(e, g()))

Now memory use remains small, and "generator done" appears at the output end.

However, you won't wait long enough to see that, because it's horridly slow :-( imap() not only treats that iterable as an iterable, but effectively passes only 1 item at a time across process boundaries. To get speed back too, this works:

r = list(p.imap(e, g(), chunksize=10000))

In real life, I'm much more likely to iterate over an imap() (or imap_unordered()) result than to force it into a list, and then memory use remains small for looping over the results too.

answered Nov 15 '22 15:11

Tim Peters

multiprocessing.map converts iterables without a __len__ method to a list before processing. This is done to aid the calculation of chunksize, which the pool uses to group worker arguments and reduce the round trip cost of scheduling jobs. This is not optimal, especially when chunksize is 1, but since map must exhaust the iterator one way or the other, its usually not a significant issue.

The relevant code is in pool.py. Notice its use of len:

def _map_async(self, func, iterable, mapper, chunksize=None, callback=None,
        error_callback=None):
    '''
    Helper function to implement map, starmap and their async counterparts.
    '''
    if self._state != RUN:
        raise ValueError("Pool not running")
    if not hasattr(iterable, '__len__'):
        iterable = list(iterable)

    if chunksize is None:
        chunksize, extra = divmod(len(iterable), len(self._pool) * 4)
        if extra:
            chunksize += 1
    if len(iterable) == 0:
        chunksize = 0

answered Nov 15 '22 15:11

tdelaney

Related questions
                            
                                Python - Increment Characters in a String by 1
                            
                                Python Iterate through list of list to make a new list in index sequence
                            
                                Using Pandas Value_Counts and matplotlib
                            
                                How to edit properties of whiskers, fliers, caps, etc. in Seaborn boxplot
                            
                                Dev_appserver.py error when trying to deploy to Google AppEngine
                            
                                How to convert String into Float? [duplicate]
                            
                                Heroku Web Server Won't Start Locally
                            
                                django-rest-framework : list parameters in URL
                            
                                Generating list of lists with custom value limitations with Hypothesis
                            
                                Append nested dictionaries
                            
                                is there away to output selected columns names from SelectFromModel method?
                            
                                How to use --failfast in django test command line?
                            
                                Flask app getting error of "could not locate flask application. .....FLASK_APP environment variable" for Flask Migrate
                            
                                Debugging Bokeh serve application using PyCharm
                            
                                Can't load plugin: sqlalchemy.dialects:sqlite3
                            
                                Elastisearch update by query
                            
                                Format align using a variable?
                            
                                Python Numpy: replace values in one array with corresponding values in another array
                            
                                What is the correct way to write asyncio code for use with AWS Lambda?
                            
                                importing an excel file to python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With