Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python multiprocessing pool not chunking properly

Here's a simplified version of the code I'm working with: I have a python class with an instance method which takes in a list of strings and computes a result for each string, ultimately combining the results before returning, like so:

class Foo(object):
    def do_task(stringList):
        for s in stringList:
            result = computeResult(s)
        # combine results below...

Since the computations with the strings are all independent (and rather expensive), I'm trying to parallelize the operation with the Pool class in the multiprocessing module. I've thus defined a parallel version of do_task as follows (I'm currently just printing the separate results instead of combining them):

def do_task_parallel(stringList):
    numProcs = 2
    pool = Pool(processes=numProcs)
    chunksize = int(math.ceil(len(stringList) / float(numProcs)))
    for result in pool.imap(self.do_task, stringList, chunksize):
        print result
    pool.close()

According to my understanding of how Pool works based on documentation and examples I've read, this should split my stringList iterable into chunks roughly of size chunkSize, each of which is submitted as a task to one of the processes in the pool. Thus, if I have a list stringList = ["foo1", "foo2", "foo3", "foo4"] split up amongst 2 processes (giving a chunksize of 2), pool should divide this up into stringList1 = ["foo1", "foo2"] and stringList2 = ["foo3", "foo4"], which will be handled by the two different processes in parallel.

However, when I create a Foo() object and call foo.do_task_parallel(stringList), it seems that pool is passing each element of my stringList separately to do_task (as a chunk of one). Not only does this not speed up my code, but it makes it incorrect and actually slows it down, as do_task then calls computeResult on each character of the one input string passed in on each of the four separate calls. I was expecting two calls with each call handling an input list of size 2, not four calls handling a single input string. I've checked and chunksize is indeed 2. What am I doing wrong? If it helps, I'm running python 2.7.3 on Windows 7 through cygwin.

like image 979
Tom Swift Avatar asked Mar 24 '26 11:03

Tom Swift


1 Answers

Your understanding is off ;-) chunksize is purely an optional optimization: it changes nothing about what's passed to the worker functions, it only gives a hint to the multiprocessing machinery about how many tasks to send over the internal inter-process pipes at a time.

If you want your worker function to be passed a list of strings, then you have to explicitly code that. For example, and sticking it on multiple lines for clarity:

chunks = [stringList[i: i+chunksize]
          for i in xrange(0, len(stringList), chunksize)]

for result in pool.imap(self.do_task, chunks):
    print result
like image 173
Tim Peters Avatar answered Mar 28 '26 01:03

Tim Peters