Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

multiprocessing.Pool with a global variable

I am using the Pool class from python's multiprocessing library write a program that will run on an HPC cluster.

Here is an abstraction of what I am trying to do:

def myFunction(x):
    # myObject is a global variable in this case
    return myFunction2(x, myObject)

def myFunction2(x,myObject):
    myObject.modify() # here I am calling some method that changes myObject
    return myObject.f(x)

poolVar = Pool()
argsArray = [ARGS ARRAY GOES HERE]
output = poolVar.map(myFunction, argsArray)

The function f(x) is contained in a *.so file, i.e., it is calling a C function.

The problem I am having is that the value of the output variable is different each time I run my program (even though the function myObject.f() is a deterministic function). (If I only have one process then the output variable is the same each time I run the program.)

I have tried creating the object rather than storing it as a global variable:

def myFunction(x):
    myObject = createObject()
    return myFunction2(x, myObject)

However, in my program the object creation is expensive, and thus, it is a lot easier to create myObject once and then modify it each time I call myFunction2(). Thus, I would like to not have to create the object each time.

Do you have any tips? I am very new to parallel programming so I could be going about this all wrong. I decided to use the Pool class since I wanted to start with something simple. But I am willing to try a better way of doing it.

like image 559
Hugh Medal Avatar asked Sep 13 '13 04:09

Hugh Medal


1 Answers

I am using the Pool class from python's multiprocessing library to do some shared memory processing on an HPC cluster.

Processes are not threads! You cannot simply replace Thread with Process and expect all to work the same. Processes do not share memory, which means that the global variables are copied, hence their value in the original process doesn't change.

If you want to use shared memory between processes then you must use the multiprocessing's data types, such as Value, Array, or use the Manager to create shared lists etc.

In particular you might be interested in the Manager.register method, which allows the Manager to create shared custom objects(although they must be picklable).

However I'm not sure whether this will improve the performance. Since any communication between processes requires pickling, and pickling takes usually more time then simply instantiating the object.

Note that you can do some initialization of the worker processes passing the initializer and initargs argument when creating the Pool.

For example, in its simplest form, to create a global variable in the worker process:

def initializer():
    global data
    data = createObject()

Used as:

pool = Pool(4, initializer, ())

Then the worker functions can use the data global variable without worries.


Style note: Never use the name of a built-in for your variables/modules. In your case object is a built-in. Otherwise you'll end up with unexpected errors which may be obscure and hard to track down.

like image 53
Bakuriu Avatar answered Oct 01 '22 02:10

Bakuriu