Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

multiprocessing.Pool spawning more processes than requested only on Google Cloud

I am using Python's multiprocessing.Pool class to distribute tasks among processes.

The simple case works as expected:

from multiprocessing import Pool

def evaluate:
    do_something()

pool = Pool(processes=N)
for task in tasks:
    pool.apply_async(evaluate, (data,))

N processes are spawned, and they continually work through the tasks that I pass into apply_async. Now, I have another case where I have many different very complex objects which each need to do computationally heavy activity. I initially let each object create its own multiprocessing.Pool on demand at the time it was completing work, but I eventually ran into OSError for having too many files open, even though I would have assumed that the pools would get garbage collected after use.

At any rate, I decided it would be preferable anyway for each of these complex objects to share the same Pool for computations:

from multiprocessing import Pool

def evaluate:
    do_something()

pool = Pool(processes=N)

class ComplexClass:

    def work:
        for task in tasks:
            self.pool.apply_async(evaluate, (data,))

objects = [ComplexClass() for i in range(50)]

for complex in objects:
    complex.pool = pool


while True:
    for complex in objects:
        complex.work()

Now, when I run this on one of my computers (OS X, Python=3.4), it works just as expected. N processes are spawned, and each complex object distributes their tasks among each of them. However, when I ran it on another machine (Google Cloud instance running Ubuntu, Python=3.5), it spawns an enormous number of processes (>> N) and the entire program grinds to a halt due to contention.

If I check the pool for more information:

import random
random_object = random.sample(objects, 1)
print (random_object.pool.processes)

>>> N

Everything looks correct. But it's clearly not. Any ideas what may be going on?

UPDATE

I added some additional logging. I set the pool size to 1 for simplicity. Within the pool, as a task is being completed, I print the current_process() from the multiprocessing module, as well as the pid of the task using os.getpid(). It results in something like this:

<ForkProcess(ForkPoolWorker-1, started daemon)>, PID: 5122
<ForkProcess(ForkPoolWorker-1, started daemon)>, PID: 5122
<ForkProcess(ForkPoolWorker-1, started daemon)>, PID: 5122
<ForkProcess(ForkPoolWorker-1, started daemon)>, PID: 5122
...

Again, looking at actually activity using htop, I'm seeing many processes (one per object sharing the multiprocessing pool) all consuming CPU cycles as this is happening, resulting in so much OS contention that progress is very slow. 5122 appears to be the parent process.

like image 401
TSM Avatar asked Nov 16 '17 17:11

TSM


1 Answers

1. Infinite Loop implemented

If you implement an infinite loop, then it will run like an infinite loop. Your example (which does not work at all due to other reasons) ...

while True:
    for complex in objects:
        complex.work()

2. Spawn or Fork Processes?

Even though your code above shows only some snippets, you cannot expect the same results on Windows / MacOS on the one hand and Linux on the other. The former spawn processes, the latter fork them. If you use global variables which can have state, you will run into troubles when developing on one environment and running on the other.

Make sure, not to use global statefull variables in your processes. Just pass them explicitly or get rid of them in another way.

3. Use a Program, not a Script

Write a program with the minimal requirement to have a __main__. Especially, when you use Multiprocessing you need this. Instantiate your Pool in that namespace.

like image 180
thomas Avatar answered Nov 15 '22 14:11

thomas