I would like to know what happens if a node of a OpenMPI/MPICH2 cluster terminates? Is there some mechanism that is tolerant for this case and continues the execution?
Thanks for your answers Heinrich
Note that a feature that has existed since MPI 1.x days is that you can set an error handler: eg,
http://www.mpi-forum.org/docs/mpi-11-html/node148.html
As Mark notes, most of us just use MPI_ERRORS_ARE_FATAL (which is the default) because our algorithms are very state-heavy and can't easily be recovered (except through checkpointing, which most of us do anyway).
But that need not be the case; you can have the MPI functions return the error messages and try to recover as best you can.
There are a few fault-tolerant MPI packages out there -- http://icl.cs.utk.edu/ftmpi/ (which is kind of old and only implements MPI 1.2 functionality). More recently, http://osl.iu.edu/research/ft/cifts/ is one approach being put into OpenMPI as a separate project, and there is also an OS-level checkpoint/restart package, BLCR, which may be of interest.
The MPI-3 forum is discussing a standard fault-tolerance API in MPI, so the pace of such projects is accellerating.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With