when we use mpi_send/receive
functions what happens? I mean this communication is done by value or by address of the variable that we desire to be sent and received (for example process 0 wants send variable "a" to process 1. Process 0 what exactly sends value of variable "a" or address of "a" ). And what happens when we use derived data types for communication?
Quite a bit of magic happens behind the scenes.
First, there's the unexpected message queue. When the sender calls MPI_Send
before the receiver has called MPI_Recv
, MPI doesn't know where in the receiver's memory the message is going. Two things can happen at this point. If the message is short, it is copied to a temporary buffer at the receiver. When the receiver calls MPI_Recv
it first checks if a matching message has already arrived, and if it has, copies the data to the final destination. If not, the information about the target buffer is stored in the queue so the MPI_Recv
can be matched when the message arrives. It is possible to examine the unexpected queue with MPI_Probe
.
If the message is longer than some threshold, copying it would take too long. Instead, the sender and the receiver do a handshake with a rendezvous protocol of some sort to make sure the target is ready to receive the message before it is sent out. This is especially important with a high-speed network like InfiniBand.
If the communicating ranks are on the same machine, usually the data transfer happens through shared memory. Because MPI ranks are independent processes, they do not have access to each other's memory. Instead, the MPI processes on the same node set up a shared memory region and use it to transfer messages. So sending a message involves copying the data twice: the sender copies it into the shared buffer, and the receiver copies it out into its own address space. There exists an exception to this. If the MPI library is configured to use a kernel module like KNEM, the message can be copied directly to the destination in the OS kernel. However, such a copy incurs a penalty of a system call. Copying through the kernel is usually only worth it for large messages. Specialized HPC operating systems like Catamount can change these rules.
Collective operations can be implemented either in terms of send/receive, or can have a completely separate optimized implementation. It is common to have implementations of several algorithms for a collective operation. The MPI library decides at runtime which one to use for best performance depending on the size of the messages and the communicator.
A good MPI implementation will try very hard to transfer a derived datatype without creating extra copies. It will figure out which regions of memory within a datatype are contiguous and copy them individually. However, in some cases MPI will fall back to using MPI_Pack behind the scenes to make the message contiguous, and then transfer and unpack it.
As far as the applications system programmer need be concerned these operations send and receive data, not addresses of data. MPI processes do not share an address space, so an address on process 0 is meaningless to an operation on process 1 - if process 1 wants the data at an address on process 0 it has to get it from process 0. I don't think that the single-sided communications which came in with MPI-2 materially affect this situation.
What goes on under the hood, the view from the developer of the MPI libraries, might be different and will certainly be implementation dependent. For example, if you are using a well written MPI library on a shared-memory machine then yes, it might just implement message passing by sending pointers to address locations around the system. But this is a corner case, and not much seen these days.
mpi_send
requires you to give the address to the memory holding the data to be sent. It will return only when it is safe for you to re-use that memory (non-blocking communications can avoid this).
Similarly, mpi_recv
requires you to give the address of sufficient memory where it can copy the data to be received into. It will return only when the data have been received into that buffer.
How MPI does that, is another matter and you don't need to worry about that for writing a working MPI program (but possibly for writing an efficient one).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With