Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle job cancelation in Slurm?

I am using Slurm job manager on an HPC cluster. Sometimes there are situations, when a job is canceled due to time limit and I would like to finish my program gracefully.

As far as I understand, the process of cancellation occurs in two stages exactly for a software developer to be able to finish the program gracefully:

srun: Job step aborted: Waiting up to 62 seconds for job step to finish.                                                                                                                           
slurmstepd: error: *** JOB 18522559 ON ncm0317 CANCELLED AT 2020-12-14T19:42:43 DUE TO TIME LIMIT ***

You can see that I am given 62 seconds to finish the job the way I want it to finish (by saving some files, etc.).

Question: how to do this? I understand that first some Unix signal is sent to my job and I need to respond to it correctly. However, I cannot find in the Slurm documentation any information on what this signal is. Besides, I do not exactly how to handle it in Python, probably, through exception handling.

like image 711
Dmitry Kabanov Avatar asked Dec 16 '20 15:12

Dmitry Kabanov


People also ask

How do I cancel multiple slurm jobs?

If you want to cancel all of your jobs then you can use scancel -u username , where username is your system username (i.e. jharri62 is my username). Often you may want to be selective and keep some jobs running, but cancel others.

How do I cancel all pending jobs on slurm?

The job must be cancelled to release a resource allocation. To cancel a job, invoke scancel without --signal option. This will send first a SIGCONT to all steps to eventually wake them up followed by a SIGTERM, then wait the KillWait duration defined in the slurm.

How do I cancel a job Squeue?

The tasks continue to run, but not under SLURM management. If you do kill/skill an srun job, you can use squeue to get the job id and then either scancel the job, or use srun -p <partition> -a <jobid> -j , to reattach srun to the job and then you can use Ctrl-C to cancel it.

What does CG mean in slurm?

"CG" stands for "completing" and it happens to a job that cannot be terminated, probably because of an I/O operation. More detailed info in the Slurm Troubleshooting Guide.


2 Answers

In Slurm, you can decide which signal is sent at which moment before your job hits the time limit.

From the sbatch man page:

--signal=[[R][B]:]<sig_num>[@<sig_time>] When a job is within sig_time seconds of its end time, send it the signal sig_num.

So set

#SBATCH --signal=B:TERM@05:00

to get Slurm to signal the job with SIGTERM 5 minutes before the allocation ends. Note that depending on how you start your job, you might need to remove the B: part.

In your Python script, use the signal package. You need to define a "signal handler", a function that will be called when the signal is receive, and "register" that function for a specific signal. As that function is disrupting the normal flow when called , you need to keep it short and simple to avoid unwanted side effects, especially with multithreaded code.

A typical scheme in a Slurm environment is to have a script skeleton like this:

#! /bin/env python

import signal, os, sys

# Global Boolean variable that indicates that a signal has been received
interrupted = False

# Global Boolean variable that indicates then natural end of the computations
converged = False

# Definition of the signal handler. All it does is flip the 'interrupted' variable
def signal_handler(signum, frame):
    global interrupted
    interrupted = True

# Register the signal handler
signal.signal(signal.SIGTERM, signal_handler)

try:
    # Try to recover a state file with the relevant variables stored
    # from previous stop if any
    with open('state', 'r') as file: 
        vars = file.read()
except:
    # Otherwise bootstrap (start from scratch)
    vars = init_computation()

while not interrupted and not converged:
    do_computation_iteration()    

# Save current state 
if interrupted:
    with open('state', 'w') as file: 
        file.write(vars)
    sys.exit(99)
sys.exit(0)

This first tries to restart computations left by a previous run of the job, and otherwise bootstraps it. If it was interrupted, it lets the current loop iteration finish properly, and then saves the needed variables to disk. It then exits with the 99 return code. This allows, if Slurm is configured for it, to requeue the job automatically for further iteration.

If slurm is not configured for it, you can do it manually in the submission script like this:

python myscript.py || scontrol requeue $SLURM_JOB_ID
like image 186
damienfrancois Avatar answered Oct 17 '22 12:10

damienfrancois


In most programming languages, Unix signals are captured using a callback. Python is no exception. To catch Unix signals using Python, just use the signal package.

For example, to gracefully exit:

import signal, sys
def terminate_signal(signalnum, handler):
    print ('Terminate the process')
    # save results, whatever...
    sys.exit()

# initialize signal with a callback
signal.signal(signal.SIGTERM, terminate_signal)

while True:
    pass  # work
  • List of possible signals. SIGTERM is the one used to "politely ask a program to terminate".
like image 2
Ricardo Magalhães Cruz Avatar answered Oct 17 '22 13:10

Ricardo Magalhães Cruz