I am doing some parallel processing, as follows: <pre class="prettyprint"><code>with mp.Pool(8) as tmpPool: results = tmpPool.starmap(my_function, inputs) </code></pre> where inputs look like: [(1,0.2312),(5,0.52) ...] i.e., tuples of an int and a float. The code runs nicely, yet I cannot seem to wrap it around a loading bar (tqdm), such as can be done with e.g., imap method as follows: <pre class="prettyprint"><code>tqdm.tqdm(mp.imap(some_function,some_inputs)) </code></pre> Can this be done for starmap also? Thanks!

It's not possible with <code>starmap()</code>, but it's possible with a patch adding <code>Pool.istarmap()</code>. It's based on the code for <code>imap()</code>. All you have to do, is create the <code>istarmap.py</code>-file and import the module to apply the patch before you make your regular multiprocessing-imports. Python <3.8 <pre class="prettyprint"><code># istarmap.py for Python <3.8 import multiprocessing.pool as mpp def istarmap(self, func, iterable, chunksize=1): """starmap-version of imap """ if self._state != mpp.RUN: raise ValueError("Pool not running") if chunksize < 1: raise ValueError( "Chunksize must be 1+, not {0:n}".format( chunksize)) task_batches = mpp.Pool._get_tasks(func, iterable, chunksize) result = mpp.IMapIterator(self._cache) self._taskqueue.put( ( self._guarded_task_generation(result._job, mpp.starmapstar, task_batches), result._set_length )) return (item for chunk in result for item in chunk) mpp.Pool.istarmap = istarmap </code></pre> Python 3.8+ <pre class="prettyprint"><code># istarmap.py for Python 3.8+ import multiprocessing.pool as mpp def istarmap(self, func, iterable, chunksize=1): """starmap-version of imap """ self._check_running() if chunksize < 1: raise ValueError( "Chunksize must be 1+, not {0:n}".format( chunksize)) task_batches = mpp.Pool._get_tasks(func, iterable, chunksize) result = mpp.IMapIterator(self) self._taskqueue.put( ( self._guarded_task_generation(result._job, mpp.starmapstar, task_batches), result._set_length )) return (item for chunk in result for item in chunk) mpp.Pool.istarmap = istarmap </code></pre> Then in your script: <pre class="prettyprint"><code>import istarmap # import to apply patch from multiprocessing import Pool import tqdm def foo(a, b): for _ in range(int(50e6)): pass return a, b if __name__ == '__main__': with Pool(4) as pool: iterable = [(i, 'x') for i in range(10)] for _ in tqdm.tqdm(pool.istarmap(foo, iterable), total=len(iterable)): pass </code></pre>

The simplest way would probably be to apply tqdm() around the inputs, rather than the mapping function. For example: <pre class="prettyprint"><code>inputs = zip(param1, param2, param3) with mp.Pool(8) as pool: results = pool.starmap(my_function, tqdm.tqdm(inputs, total=len(param1))) </code></pre>

As Darkonaut mentioned, as of this writing there's no <code>istarmap</code> natively available. If you want to avoid patching, you can add a simple *<code>_star</code> function as a workaround. (This solution inspired by this tutorial.) <pre class="prettyprint lang-py prettyprint-override"><code>import tqdm import multiprocessing def my_function(arg1, arg2, arg3): return arg1 + arg2 + arg3 def my_function_star(args): return my_function(*args) jobs = 4 with multiprocessing.Pool(jobs) as pool: args = [(i, i, i) for i in range(10000)] results = list(tqdm.tqdm(pool.imap(my_function_star, args), total=len(args)) </code></pre> <hr> Some notes: I also really like corey's answer. It's cleaner, though the progress bar does not appear to update as smoothly as my answer. Note that corey's answer is several orders of magnitude faster with the code I posted above with <code>chunksize=1</code> (default). I'm guessing this is due to multiprocessing serialization, because increasing <code>chunksize</code> (or having a more expensive <code>my_function</code>) makes their runtime comparable. I went with my answer for my application since my serialization/function cost ratio was very low.

Starmap combined with tqdm?

Tags:

python

multiprocessing

python-multiprocessing

tqdm

process-pool

I am doing some parallel processing, as follows:

with mp.Pool(8) as tmpPool:
        results = tmpPool.starmap(my_function, inputs)

where inputs look like: [(1,0.2312),(5,0.52) ...] i.e., tuples of an int and a float.

The code runs nicely, yet I cannot seem to wrap it around a loading bar (tqdm), such as can be done with e.g., imap method as follows:

tqdm.tqdm(mp.imap(some_function,some_inputs))

Can this be done for starmap also?

Thanks!

718

asked Aug 05 '19 08:08

sdgaw erzswer

3 Answers

It's not possible with starmap(), but it's possible with a patch adding Pool.istarmap(). It's based on the code for imap(). All you have to do, is create the istarmap.py-file and import the module to apply the patch before you make your regular multiprocessing-imports.

Python <3.8

# istarmap.py for Python <3.8
import multiprocessing.pool as mpp


def istarmap(self, func, iterable, chunksize=1):
    """starmap-version of imap
    """
    if self._state != mpp.RUN:
        raise ValueError("Pool not running")

    if chunksize < 1:
        raise ValueError(
            "Chunksize must be 1+, not {0:n}".format(
                chunksize))

    task_batches = mpp.Pool._get_tasks(func, iterable, chunksize)
    result = mpp.IMapIterator(self._cache)
    self._taskqueue.put(
        (
            self._guarded_task_generation(result._job,
                                          mpp.starmapstar,
                                          task_batches),
            result._set_length
        ))
    return (item for chunk in result for item in chunk)


mpp.Pool.istarmap = istarmap

Python 3.8+

# istarmap.py for Python 3.8+
import multiprocessing.pool as mpp


def istarmap(self, func, iterable, chunksize=1):
    """starmap-version of imap
    """
    self._check_running()
    if chunksize < 1:
        raise ValueError(
            "Chunksize must be 1+, not {0:n}".format(
                chunksize))

    task_batches = mpp.Pool._get_tasks(func, iterable, chunksize)
    result = mpp.IMapIterator(self)
    self._taskqueue.put(
        (
            self._guarded_task_generation(result._job,
                                          mpp.starmapstar,
                                          task_batches),
            result._set_length
        ))
    return (item for chunk in result for item in chunk)


mpp.Pool.istarmap = istarmap

Then in your script:

import istarmap  # import to apply patch
from multiprocessing import Pool
import tqdm    


def foo(a, b):
    for _ in range(int(50e6)):
        pass
    return a, b    


if __name__ == '__main__':

    with Pool(4) as pool:
        iterable = [(i, 'x') for i in range(10)]
        for _ in tqdm.tqdm(pool.istarmap(foo, iterable),
                           total=len(iterable)):
            pass

173

answered Oct 20 '22 07:10

Darkonaut

The simplest way would probably be to apply tqdm() around the inputs, rather than the mapping function. For example:

inputs = zip(param1, param2, param3)
with mp.Pool(8) as pool:
    results = pool.starmap(my_function, tqdm.tqdm(inputs, total=len(param1)))

answered Oct 20 '22 07:10

corey

As Darkonaut mentioned, as of this writing there's no istarmap natively available. If you want to avoid patching, you can add a simple *_star function as a workaround. (This solution inspired by this tutorial.)

import tqdm
import multiprocessing

def my_function(arg1, arg2, arg3):
  return arg1 + arg2 + arg3

def my_function_star(args):
    return my_function(*args)

jobs = 4
with multiprocessing.Pool(jobs) as pool:
    args = [(i, i, i) for i in range(10000)]
    results = list(tqdm.tqdm(pool.imap(my_function_star, args), total=len(args))

Some notes:

I also really like corey's answer. It's cleaner, though the progress bar does not appear to update as smoothly as my answer. Note that corey's answer is several orders of magnitude faster with the code I posted above with chunksize=1 (default). I'm guessing this is due to multiprocessing serialization, because increasing chunksize (or having a more expensive my_function) makes their runtime comparable.

I went with my answer for my application since my serialization/function cost ratio was very low.

answered Oct 20 '22 07:10

cydonian

Related questions
                            
                                How to convert string to datetime with nulls - python, pandas?
                            
                                determine OS distribution of a docker image
                            
                                How to add a new entry into a dictionary object while using jinja2?
                            
                                Expected view to be called with a URL keyword argument named "pk"
                            
                                Cannot understand numpy argpartition output
                            
                                Use functools' @lru_cache without specifying maxsize parameter
                            
                                AttributeError: 'str' object has no attribute 'decode' in fitting Logistic Regression Model
                            
                                List comprehension for loops Python
                            
                                Why always add self as first argument to class methods? [duplicate]
                            
                                Create file but if name exists add number
                            
                                Cygwin gcc issue - cannot find Python.h
                            
                                PySpark Drop Rows
                            
                                Write comments in CSV file with pandas
                            
                                How to let MagicMock behave like a dict?
                            
                                How can I log request POST body in Flask?
                            
                                changing the marker size in python seaborn lmplot
                            
                                Disallowed Host at Django
                            
                                Pandas : balancing data
                            
                                Different colors for points and line in Seaborn regplot
                            
                                GCS - Read a text file from Google Cloud Storage directly into python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Starmap combined with tqdm?

Tags:

python

multiprocessing

python-multiprocessing

tqdm

process-pool

sdgaw erzswer

People also ask

3 Answers

Darkonaut

corey

cydonian

Recent Activity

Donate For Us