Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

End batch job before kill via walltime

I am running a batch job with SLURM. The process I start in the jobfile is iterative. After each iteration, the program can be killed softly by creating a file called stop. I would like such a stop command to be issued authomatically one hour before the job is killed via the walltime limit.

like image 888
user1638145 Avatar asked Nov 07 '14 13:11

user1638145


1 Answers

You can have Slurm signal your job a configurable amount of time before the time limit happens with the --signal option

from the sbatch man page:

--signal=[B:][@] When a job is within sig_time seconds of its end time, send it the signal sig_num. Due to the resolution of event handling by SLURM, the signal may be sent up to 60 seconds earlier than specified. sig_num may either be a signal number or name (e.g. "10" or "USR1"). sig_time must have integer value between zero and 65535. By default, no signal is sent before the job’s end time. If a sig_num is specified without any sig_time, the default time will be 60 seconds. Use the "B:" option to signal only the batch shell, none of the other processes will be signaled. By default all job steps will be signalled, but not the batch shell itself.

If you can modify your program to catch that signal to stop rather than looking for a file, then this is the best option.

If you can't, add something like

trap  "touch ./stop"  SIGUSR1

in your submission script. With --signal=B:SIGUSR1@3600 this will make the script catch the SIGUSR1 signal and create the stop file one hour before the end of the allocation.

Note that only the recent versions of Slurm have the B: option in --signal. If your version does not have it, you'll need to setup a watch dog. See examples here.

like image 168
damienfrancois Avatar answered Oct 19 '22 09:10

damienfrancois