I am making use of Python's multiprocessor library and wondering what would be the maximum of worker processes I can call?
E.g. I have defined async.pool = Pool(100)
which would allow me to have max 100 async processes running at the same time, but I have no clue what would be the real maximum value for this?
Does anyone know how to find the max value for my Pool? I'm guessing it depends on CPU or memory.
This is not a complete answer, but the source can help guide us. When you pass maxtasksperchild
to Pool
it saves this value as self._maxtasksperchild
and only uses it in the creation of a worker
object:
def _repopulate_pool(self):
"""Bring the number of pool processes up to the specified number,
for use after reaping workers which have exited.
"""
for i in range(self._processes - len(self._pool)):
w = self.Process(target=worker,
args=(self._inqueue, self._outqueue,
self._initializer,
self._initargs, self._maxtasksperchild)
)
...
This worker object uses maxtasksperchild
like so:
assert maxtasks is None or (type(maxtasks) == int and maxtasks > 0)
which wouldn't change the physical limit, and
while maxtasks is None or (maxtasks and completed < maxtasks):
try:
task = get()
except (EOFError, IOError):
debug('worker got EOFError or IOError -- exiting')
break
...
put((job, i, result))
completed += 1
essentially saving the results from each task. While you could run into memory issues by saving too many results, you can achieve the same error by making a list too large in the first place. In short, the source does not suggest a limit to the number of tasks possible as long as the results can fit in memory once released.
Does this answer the question? Not entirely. However, on Ubuntu 12.04 with Python 2.7.5 this code, while inadvisable seems to run just fine for any large max_task value. Be warned that the output seems to take exponentially longer to run for large values:
import multiprocessing, time
max_tasks = 10**3
def f(x):
print x**2
time.sleep(5)
return x**2
P = multiprocessing.Pool(max_tasks)
for x in xrange(max_tasks):
P.apply_async(f,args=(x,))
P.close()
P.join()
You can use as many workers as you have memory for.
That being said, if you set up a pool without any process
flag, you'll get workers equal to the machine CPUs:
From Pool
docs:
processes is the number of worker processes to use. If processes is None then the number returned by os.cpu_count() is used.
If you're doing CPU intensive work, i wouldn't want more workers in the pool than your CPU count. More workers would force the OS to context switch out your processes, which in turn lowers the system performance. Even resorting to using hyperthreading cores can, depending on your work, choke the processor.
On the other hand, if your task is like a webserver with many concurrent requests that individually are not maxing out your processor, go ahead and spawn as many workers as you've got memory and/or IO capacity for.
maxtasksperchild
is something different. This flag forces the pool to release all resources accumulated by a worker, once the worker has been used/reused a certain number of times.
If you imagine your workers read from a disk, and this work has some setup overhead, maxtasksperchild
will clear that overhead once a worker has done this many tasks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With