Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Open MPI/MPICH - What happens if a node terminates?

I would like to know what happens if a node of a OpenMPI/MPICH2 cluster terminates? Is there some mechanism that is tolerant for this case and continues the execution?

Thanks for your answers Heinrich

like image 942
Erik Avatar asked Oct 13 '22 21:10

Erik


1 Answers

Note that a feature that has existed since MPI 1.x days is that you can set an error handler: eg,

http://www.mpi-forum.org/docs/mpi-11-html/node148.html

As Mark notes, most of us just use MPI_ERRORS_ARE_FATAL (which is the default) because our algorithms are very state-heavy and can't easily be recovered (except through checkpointing, which most of us do anyway).

But that need not be the case; you can have the MPI functions return the error messages and try to recover as best you can.

There are a few fault-tolerant MPI packages out there -- http://icl.cs.utk.edu/ftmpi/ (which is kind of old and only implements MPI 1.2 functionality). More recently, http://osl.iu.edu/research/ft/cifts/ is one approach being put into OpenMPI as a separate project, and there is also an OS-level checkpoint/restart package, BLCR, which may be of interest.

The MPI-3 forum is discussing a standard fault-tolerance API in MPI, so the pace of such projects is accellerating.

like image 72
Jonathan Dursi Avatar answered Oct 18 '22 01:10

Jonathan Dursi