Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MPI Internals: Communication implementation between processes

I am trying to figure out how the actual process communication is happening inside MPI communicators. I have 8 nodes, each has 12 cores (96 instances running). Each process has unique rank assigned and processes are able to communicate between each other. So how does processes get unique rank and manage to send the actual messages? According to some slides there is Open Run-Time Environment (ORTE), which "Resides on machine from which processes are launched on that cell. (e.g., front-end of cluster). Responsible for launching processes on the cell. Monitoring cell health (nodes, processes). Reporting cell state to rest of universe. Routing communication between cells." I have not managed to find any developer documentation and/or architecture papers of MPI implementations. Anyone has ideas how actual communication between MPI processes is implemented, i.e. how they manage to find themselves and get assigned ranks? Is there a central or several central MPI internal processes for routing (e.g., per node)?

Thanks, David

like image 768
davidlt Avatar asked May 10 '12 21:05

davidlt


1 Answers

The mechanisms you are talking about are strictly implementation dependent. MPI is a middle level standard that sits on top of what communication mechanisms are provided by the hardware and the operating system.

ORTE is part of Open MPI - one of the generic MPI implementations in the wild today. There are also MPICH and MPICH2 and their variants (e.g. Intel MPI). Most supercomputer vendors provide their own MPI implementations (e.g. IBM provide a modified MPICH2 for Blue Gene/Q).

The way Open MPI functions is that it is divided in multiple layers and each layer's functionality is provided by multitude of modules that are loaded dynamically. There is a scoring mechanism that is supposed to select the best module under certain conditions.

All MPI implementations provide a mechanism to do the so called SPMD launch. Essentially an MPI application is a specialiased kind of SPMD (Single Program Multiple Data) - many copies of a single executable are run and message passing is employed as a mechanism for communication and coordination. It is the SPMD launcher that takes a list of execution nodes, launches the process remotely and establishes the association and communication scheme between them (in Open MPI this is called the MPI Universe). It is the one that creates the global MPI communicator MPI_COMM_WORLD and distributes the initial rank assignment and it can provide options like binding of processes to CPU cores (very important on NUMA systems). Once processes are launched some mechanism for identification is available (e.g. mapping between rank and IP address/TCP port) other addressing schemes might be employed. Open MPI, for example, launches remote processes using ssh, rsh or it can use the mechanisms provided by different resource management system (e.g. PBS/Torque, SLURM, Grid Engine, LSF...). Once processes are up and their IP addresses and port numbers are recorded and broadcasted in the Universe, the processes can find each other on other (faster) networks, e.g. InfiniBand, and establish communication routes on them.

Routing messages is usually not done by the MPI itself, but is left to the underlying communication network. MPI only takes care of constructing the messages and then passes them to the network to be delivered to their destination. For communication between processes that reside on the same node, shared memory is usually used.

If you are interested in the technical details, I'd recommend that you read into the source code of Open MPI. You can find it on the project's WEB site.

like image 67
Hristo Iliev Avatar answered Oct 21 '22 03:10

Hristo Iliev