Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using the multiprocessing module for cluster computing

I'm interested in running a Python program using a computer cluster. I have in the past been using Python MPI interfaces, but due to difficulties in compiling/installing these, I would prefer solutions which use built-in modules, such as Python's multiprocessing module.

What I would really like to do is just set up a multiprocessing.Pool instance that would span across the whole computer cluster, and run a Pool.map(...). Is this something that is possible/easy to do?

If this is impossible, I'd like to at least be able to start Process instances on any of the nodes from a central script with different parameters for each node.

like image 901
astrofrog Avatar asked Mar 03 '11 14:03

astrofrog


People also ask

What is multiprocessing module?

multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads.

When would you use a multiprocessing pool?

Python multiprocessing Pool can be used for parallel execution of a function across multiple input values, distributing the input data across processes (data parallelism).

How Ray is used in multiprocessing?

Run on a Cluster To start a Ray cluster, please refer to the cluster setup instructions. To connect a Pool to a running Ray cluster, you can specify the address of the head node in one of two ways: By setting the RAY_ADDRESS environment variable. By passing the ray_address keyword argument to the Pool constructor.

How does Python multiprocess work?

The multiprocessing Python module contains two classes capable of handling tasks. The Process class sends each task to a different processor, and the Pool class sends sets of tasks to different processors.


1 Answers

If by cluster computing you mean distributed memory systems (multiple nodes rather that SMP) then Python's multiprocessing may not be a suitable choice. It can spawn multiple processes but they will still be bound within a single node.

What you will need is a framework that handles spawing of processes across multiple nodes and provides a mechanism for communication between the processors. (pretty much what MPI does).

See the page on Parallel Processing on the Python wiki for a list of frameworks which will help with cluster computing.

From the list, pp, jug, pyro and celery look like sensible options although I can't personally vouch for any since I have no experience with any of them (I use mainly MPI).

If ease of installation/use is important, I would start by exploring jug. It's easy to install, supports common batch cluster systems, and looks well documented.

like image 79
Shawn Chin Avatar answered Sep 23 '22 23:09

Shawn Chin