Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MPI signal handling

When using mpirun, is it possible to catch signals (for example, the SIGINT generated by ^C) in the code being run?

For example, I'm running a parallelized python code. I can except KeyboardInterrupt to catch those errors when running python blah.py by itself, but I can't when doing mpirun -np 1 python blah.py.

Does anyone have a suggestion? Even finding how to catch signals in a C or C++ compiled program would be a helpful start.

If I send a signal to the spawned Python processes, they can handle the signals properly; however, signals sent to the parent orterun process (i.e. from exceeding wall time on a cluster, or pressing control-C in a terminal) will kill everything immediately.

like image 606
Seth Johnson Avatar asked Nov 05 '22 20:11

Seth Johnson


2 Answers

I think it is really implementation dependent.

  • In SLURM, I tried to use sbatch --signal USR1@30 to send SIGUSR1 (whose signum is 30,10 or 16) to the program launched by srun commands. And the process received signal SIGUSR1 = 10.

  • For platform MPI of IBM, according to https://www.ibm.com/support/knowledgecenter/en/SSF4ZA_9.1.4/pmpi_guide/signal_propagation.html

SIGINT, SIGUSR1, SIGUSR2 will be bypassed to processes.

  • In MPICH, SIGUSR1 is used by the process manager for internal notification of abnormal failures. ref: http://lists.mpich.org/pipermail/discuss/2014-October/003242.html>

  • Open MPI on the other had will forward SIGUSR1 and SIGUSR2 from mpiexec to the other processes. ref: http://www.open-mpi.org/doc/v1.6/man1/mpirun.1.php#sect14>

  • For IntelMPI, according to https://software.intel.com/en-us/mpi-developer-reference-linux-hydra-environment-variables

I_MPI_JOB_SIGNAL_PROPAGATION and I_MPI_JOB_TIMEOUT_SIGNAL can be set to send signal.

Another thing worth notice: For many python scripts, they will invoke other library or codes through cython, and if the SIGUSR1 is caught by the sub-process, something unwanted might happen.

like image 114
CatDog Avatar answered Nov 15 '22 06:11

CatDog


If you use mpirun --nw, then mpirun itself should terminate as soon as it's started the subprocesses, instead of waiting for their termination; if that's acceptable then I believe your processes would be able to catch their own signals.

like image 21
Alex Martelli Avatar answered Nov 15 '22 04:11

Alex Martelli