Open MPI/MPICH - What happens if a node terminates?

Question

I would like to know what happens if a node of a OpenMPI/MPICH2 cluster terminates? Is there some mechanism that is tolerant for this case and continues the execution?

Thanks for your answers Heinrich

Jonathan Dursi · Accepted Answer

Note that a feature that has existed since MPI 1.x days is that you can set an error handler: eg,

http://www.mpi-forum.org/docs/mpi-11-html/node148.html

As Mark notes, most of us just use MPI_ERRORS_ARE_FATAL (which is the default) because our algorithms are very state-heavy and can't easily be recovered (except through checkpointing, which most of us do anyway).

But that need not be the case; you can have the MPI functions return the error messages and try to recover as best you can.

There are a few fault-tolerant MPI packages out there -- http://icl.cs.utk.edu/ftmpi/ (which is kind of old and only implements MPI 1.2 functionality). More recently, http://osl.iu.edu/research/ft/cifts/ is one approach being put into OpenMPI as a separate project, and there is also an OS-level checkpoint/restart package, BLCR, which may be of interest.

The MPI-3 forum is discussing a standard fault-tolerance API in MPI, so the pace of such projects is accellerating.

Open MPI/MPICH - What happens if a node terminates?

Tags:

distributed-computing

cluster-computing

mpi

openmpi

mpich

Erik

1 Answers

Jonathan Dursi

Recent Activity

Donate For Us

Open MPI/MPICH - What happens if a node terminates?

Tags:

distributed-computing

cluster-computing

mpi

openmpi

mpich

Erik

1 Answers

Jonathan Dursi

Related questions

Recent Activity

Donate For Us