I am using the Pool class from python's multiprocessing library write a program that will run on an HPC cluster.
Here is an abstraction of what I am trying to do:
def myFunction(x):
# myObject is a global variable in this case
return myFunction2(x, myObject)
def myFunction2(x,myObject):
myObject.modify() # here I am calling some method that changes myObject
return myObject.f(x)
poolVar = Pool()
argsArray = [ARGS ARRAY GOES HERE]
output = poolVar.map(myFunction, argsArray)
The function f(x) is contained in a *.so file, i.e., it is calling a C function.
The problem I am having is that the value of the output variable is different each time I run my program (even though the function myObject.f() is a deterministic function). (If I only have one process then the output variable is the same each time I run the program.)
I have tried creating the object rather than storing it as a global variable:
def myFunction(x):
myObject = createObject()
return myFunction2(x, myObject)
However, in my program the object creation is expensive, and thus, it is a lot easier to create myObject once and then modify it each time I call myFunction2(). Thus, I would like to not have to create the object each time.
Do you have any tips? I am very new to parallel programming so I could be going about this all wrong. I decided to use the Pool class since I wanted to start with something simple. But I am willing to try a better way of doing it.
I am using the Pool class from python's multiprocessing library to do some shared memory processing on an HPC cluster.
Processes are not threads! You cannot simply replace Thread
with Process
and expect all to work the same. Process
es do not share memory, which means that the global variables are copied, hence their value in the original process doesn't change.
If you want to use shared memory between processes then you must use the multiprocessing
's data types, such as Value
, Array
, or use the Manager
to create shared lists etc.
In particular you might be interested in the Manager.register
method, which allows the Manager
to create shared custom objects(although they must be picklable).
However I'm not sure whether this will improve the performance. Since any communication between processes requires pickling, and pickling takes usually more time then simply instantiating the object.
Note that you can do some initialization of the worker processes passing the initializer
and initargs
argument when creating the Pool
.
For example, in its simplest form, to create a global variable in the worker process:
def initializer():
global data
data = createObject()
Used as:
pool = Pool(4, initializer, ())
Then the worker functions can use the data
global variable without worries.
Style note: Never use the name of a built-in for your variables/modules. In your case object
is a built-in. Otherwise you'll end up with unexpected errors which may be obscure and hard to track down.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With