I am running a batch job with SLURM. The process I start in the jobfile is iterative. After each iteration, the program can be killed softly by creating a file called stop. I would like such a stop command to be issued authomatically one hour before the job is killed via the walltime limit.
You can have Slurm signal your job a configurable amount of time before the time limit happens with the --signal
option
from the sbatch
man page:
--signal=[B:][@] When a job is within sig_time seconds of its end time, send it the signal sig_num. Due to the resolution of event handling by SLURM, the signal may be sent up to 60 seconds earlier than specified. sig_num may either be a signal number or name (e.g. "10" or "USR1"). sig_time must have integer value between zero and 65535. By default, no signal is sent before the job’s end time. If a sig_num is specified without any sig_time, the default time will be 60 seconds. Use the "B:" option to signal only the batch shell, none of the other processes will be signaled. By default all job steps will be signalled, but not the batch shell itself.
If you can modify your program to catch that signal to stop rather than looking for a file, then this is the best option.
If you can't, add something like
trap "touch ./stop" SIGUSR1
in your submission script. With --signal=B:SIGUSR1@3600
this will make the script catch the SIGUSR1
signal and create the stop
file one hour before the end of the allocation.
Note that only the recent versions of Slurm have the B:
option in --signal
. If your version does not have it, you'll need to setup a watch dog. See examples here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With