Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiprocessing Pool in Python - Only single CPU is utilized

Original Question

I am trying to use multiprocessing Pool in Python. This is my code:

def f(x):
    return x
    
def foo():
    p = multiprocessing.Pool()
    mapper = p.imap_unordered
    
    for x in xrange(1, 11):
        res = list(mapper(f,bar(x)))

This code makes use of all CPUs (I have 8 CPUs) when the xrange is small like xrange(1, 6). However, when I increase the range to xrange(1, 10). I observe that only 1 CPU is running at 100% while the rest are just idling. What could be the reason? Is it because, when I increase the range, the OS shutdowns the CPUs due to overheating?

How can I resolve this problem?

minimal, complete, verifiable example

To replicate my problem, I have created this example: Its a simple ngram generation from a string problem.

#!/usr/bin/python

import time
import itertools
import threading
import multiprocessing
import random
    
    
def f(x):
    return x

def ngrams(input_tmp, n):
    input = input_tmp.split()

    if n > len(input):
        n = len(input)
        
    output = []
    for i in range(len(input)-n+1):
        output.append(input[i:i+n])
    return output 
    
def foo():
    
    p = multiprocessing.Pool()
    mapper = p.imap_unordered
    
    num = 100000000 #100
    rand_list = random.sample(xrange(100000000), num)
    
    rand_str = ' '.join(str(i) for i in rand_list)

    for n in xrange(1, 100):
        res = list(mapper(f, ngrams(rand_str, n)))

        
if __name__ == '__main__':
    start = time.time()
    foo()
    print 'Total time taken: '+str(time.time() - start)

When num is small (e.g., num = 10000), I find that all 8 CPUs are utilised. However, when num is substantially large (e.g.,num = 100000000). Only 2 CPUs are used and rest are idling. This is my problem.

Caution: When num is too large it may crash your system/VM.

like image 848
Maggie Avatar asked May 07 '15 07:05

Maggie


Video Answer


1 Answers

First, ngrams itself takes a lot of time. While that's happening, it's obviously only one one core. But even when that finishes (which is very easy to test by just moving the ngrams call outside the mapper and throwing a print in before and after it), you're still only using one core. I get 1 core at 100% and the other cores all around 2%.

If you try the same thing in Python 3.4, things are a little different—I still get 1 core at 100%, but the others are at 15-25%.

So, what's happening? Well, in multiprocessing, there's always some overhead for passing parameters and returning values. And in your case, that overhead completely swamps the actual work, which is just return x.

Here's how the overhead works: The main process has to pickle the values, then put them on a queue, then wait for values on another queue and unpickle them. Each child process waits on the first queue, unpickles values, does your do-nothing work, pickles the values, and puts them on the other queue. Access to the queues has to be synchronized (by a POSIX semaphore on most non-Windows platforms, I think an NT kernel mutex on Windows).

From what I can tell, your processes are spending over 99% of their time waiting on the queue or reading or writing it.

This isn't too unexpected, given that you have a large amount of data to process, and no computation at all beyond pickling and unpickling that data.

If you look at the source for SimpleQueue in CPython 2.7, the pickling and unpickling happens with the lock held. So, pretty much all the work any of your background processes do happens with the lock held, meaning they all end up serialized on a single core.

But in CPython 3.4, the pickling and unpickling happens outside the lock. And apparently that's enough work to use up 15-25% of a core. (I believe this change happened in 3.2, but I'm too lazy to track it down.)

Still, even on 3.4, you're spending far more time waiting for access to the queue than doing anything, even the multiprocessing overhead. Which is why the cores only get up to 25%.

And of course you're spending orders of magnitude more time on the overhead than the actual work, which makes this not a great test, unless you're trying to test the maximum throughput you can get out of a particular multiprocessing implementation on your machine or something.

A few observations:

  • In your real code, if you can find a way to batch up larger tasks (explicitly—just relying on chunksize=1000 or the like here won't help), that would probably solve most of your problem.
  • If your giant array (or whatever) never actually changes, you may be able to pass it in the pool initializer, instead of in each task, which would pretty much eliminate the problem.
  • If it does change, but only from the main process side, it may be worth sharing rather than passing the data.
  • If you need to mutate it from the child processes, see if there's a way to partition the data so each task can own a slice without contention.
  • Even if you need fully-contended shared memory with explicit locking, it may still be better than passing something this huge around.
  • It may be worth getting a backport of the 3.2+ version of multiprocessing or one of the third-party multiprocessing libraries off PyPI (or upgrading to Python 3.x), just to move the pickling out of the lock.
like image 156
abarnert Avatar answered Sep 21 '22 00:09

abarnert