I'm interested in running a Python program using a computer cluster. I have in the past been using Python MPI interfaces, but due to difficulties in compiling/installing these, I would prefer solutions which use built-in modules, such as Python's multiprocessing module.
What I would really like to do is just set up a multiprocessing.Pool
instance that would span across the whole computer cluster, and run a Pool.map(...)
. Is this something that is possible/easy to do?
If this is impossible, I'd like to at least be able to start Process
instances on any of the nodes from a central script with different parameters for each node.
multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads.
Python multiprocessing Pool can be used for parallel execution of a function across multiple input values, distributing the input data across processes (data parallelism).
Run on a Cluster To start a Ray cluster, please refer to the cluster setup instructions. To connect a Pool to a running Ray cluster, you can specify the address of the head node in one of two ways: By setting the RAY_ADDRESS environment variable. By passing the ray_address keyword argument to the Pool constructor.
The multiprocessing Python module contains two classes capable of handling tasks. The Process class sends each task to a different processor, and the Pool class sends sets of tasks to different processors.
If by cluster computing you mean distributed memory systems (multiple nodes rather that SMP) then Python's multiprocessing may not be a suitable choice. It can spawn multiple processes but they will still be bound within a single node.
What you will need is a framework that handles spawing of processes across multiple nodes and provides a mechanism for communication between the processors. (pretty much what MPI does).
See the page on Parallel Processing on the Python wiki for a list of frameworks which will help with cluster computing.
From the list, pp, jug, pyro and celery look like sensible options although I can't personally vouch for any since I have no experience with any of them (I use mainly MPI).
If ease of installation/use is important, I would start by exploring jug
. It's easy to install, supports common batch cluster systems, and looks well documented.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With