Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python multiprocessing: max. number of Pool worker processes?

I am making use of Python's multiprocessor library and wondering what would be the maximum of worker processes I can call?

E.g. I have defined async.pool = Pool(100) which would allow me to have max 100 async processes running at the same time, but I have no clue what would be the real maximum value for this?

Does anyone know how to find the max value for my Pool? I'm guessing it depends on CPU or memory.

like image 713
opstalj Avatar asked Feb 25 '14 14:02

opstalj


2 Answers

This is not a complete answer, but the source can help guide us. When you pass maxtasksperchild to Pool it saves this value as self._maxtasksperchild and only uses it in the creation of a worker object:

def _repopulate_pool(self):
    """Bring the number of pool processes up to the specified number,
    for use after reaping workers which have exited.
    """
    for i in range(self._processes - len(self._pool)):
        w = self.Process(target=worker,
                         args=(self._inqueue, self._outqueue,
                               self._initializer,
                               self._initargs, self._maxtasksperchild)
                        )

        ...

This worker object uses maxtasksperchild like so:

assert maxtasks is None or (type(maxtasks) == int and maxtasks > 0)

which wouldn't change the physical limit, and

while maxtasks is None or (maxtasks and completed < maxtasks):
    try:
        task = get()
    except (EOFError, IOError):
        debug('worker got EOFError or IOError -- exiting')
        break
    ...
    put((job, i, result))
    completed += 1

essentially saving the results from each task. While you could run into memory issues by saving too many results, you can achieve the same error by making a list too large in the first place. In short, the source does not suggest a limit to the number of tasks possible as long as the results can fit in memory once released.

Does this answer the question? Not entirely. However, on Ubuntu 12.04 with Python 2.7.5 this code, while inadvisable seems to run just fine for any large max_task value. Be warned that the output seems to take exponentially longer to run for large values:

import multiprocessing, time
max_tasks = 10**3

def f(x): 
    print x**2
    time.sleep(5)
    return x**2

P = multiprocessing.Pool(max_tasks)
for x in xrange(max_tasks):
    P.apply_async(f,args=(x,))
P.close()
P.join()
like image 103
Hooked Avatar answered Oct 20 '22 12:10

Hooked


You can use as many workers as you have memory for. That being said, if you set up a pool without any process flag, you'll get workers equal to the machine CPUs:

From Pool docs:

processes is the number of worker processes to use. If processes is None then the number returned by os.cpu_count() is used.

If you're doing CPU intensive work, i wouldn't want more workers in the pool than your CPU count. More workers would force the OS to context switch out your processes, which in turn lowers the system performance. Even resorting to using hyperthreading cores can, depending on your work, choke the processor.

On the other hand, if your task is like a webserver with many concurrent requests that individually are not maxing out your processor, go ahead and spawn as many workers as you've got memory and/or IO capacity for.

maxtasksperchild is something different. This flag forces the pool to release all resources accumulated by a worker, once the worker has been used/reused a certain number of times.

If you imagine your workers read from a disk, and this work has some setup overhead, maxtasksperchild will clear that overhead once a worker has done this many tasks.

like image 26
Pebermynte Lars Avatar answered Oct 20 '22 14:10

Pebermynte Lars