How to handle job cancelation in Slurm?

Tags:

I am using Slurm job manager on an HPC cluster. Sometimes there are situations, when a job is canceled due to time limit and I would like to finish my program gracefully.

As far as I understand, the process of cancellation occurs in two stages exactly for a software developer to be able to finish the program gracefully:

srun: Job step aborted: Waiting up to 62 seconds for job step to finish.                                                                                                                           
slurmstepd: error: *** JOB 18522559 ON ncm0317 CANCELLED AT 2020-12-14T19:42:43 DUE TO TIME LIMIT ***

You can see that I am given 62 seconds to finish the job the way I want it to finish (by saving some files, etc.).

Question: how to do this? I understand that first some Unix signal is sent to my job and I need to respond to it correctly. However, I cannot find in the Slurm documentation any information on what this signal is. Besides, I do not exactly how to handle it in Python, probably, through exception handling.

711

asked Dec 16 '20 15:12

Dmitry Kabanov

2 Answers

In Slurm, you can decide which signal is sent at which moment before your job hits the time limit.

From the sbatch man page:

--signal=[[R][B]:]<sig_num>[@<sig_time>] When a job is within sig_time seconds of its end time, send it the signal sig_num.

So set

#SBATCH --signal=B:TERM@05:00

to get Slurm to signal the job with SIGTERM 5 minutes before the allocation ends. Note that depending on how you start your job, you might need to remove the B: part.

In your Python script, use the signal package. You need to define a "signal handler", a function that will be called when the signal is receive, and "register" that function for a specific signal. As that function is disrupting the normal flow when called , you need to keep it short and simple to avoid unwanted side effects, especially with multithreaded code.

A typical scheme in a Slurm environment is to have a script skeleton like this:

#! /bin/env python

import signal, os, sys

# Global Boolean variable that indicates that a signal has been received
interrupted = False

# Global Boolean variable that indicates then natural end of the computations
converged = False

# Definition of the signal handler. All it does is flip the 'interrupted' variable
def signal_handler(signum, frame):
    global interrupted
    interrupted = True

# Register the signal handler
signal.signal(signal.SIGTERM, signal_handler)

try:
    # Try to recover a state file with the relevant variables stored
    # from previous stop if any
    with open('state', 'r') as file: 
        vars = file.read()
except:
    # Otherwise bootstrap (start from scratch)
    vars = init_computation()

while not interrupted and not converged:
    do_computation_iteration()    

# Save current state 
if interrupted:
    with open('state', 'w') as file: 
        file.write(vars)
    sys.exit(99)
sys.exit(0)

This first tries to restart computations left by a previous run of the job, and otherwise bootstraps it. If it was interrupted, it lets the current loop iteration finish properly, and then saves the needed variables to disk. It then exits with the 99 return code. This allows, if Slurm is configured for it, to requeue the job automatically for further iteration.

If slurm is not configured for it, you can do it manually in the submission script like this:

python myscript.py || scontrol requeue $SLURM_JOB_ID

186

answered Oct 17 '22 12:10

damienfrancois

In most programming languages, Unix signals are captured using a callback. Python is no exception. To catch Unix signals using Python, just use the signal package.

For example, to gracefully exit:

import signal, sys
def terminate_signal(signalnum, handler):
    print ('Terminate the process')
    # save results, whatever...
    sys.exit()

# initialize signal with a callback
signal.signal(signal.SIGTERM, terminate_signal)

while True:
    pass  # work

List of possible signals. SIGTERM is the one used to "politely ask a program to terminate".

answered Oct 17 '22 13:10

Ricardo Magalhães Cruz

Related questions
                            
                                Python equivalent to clojure reductions
                            
                                Pandas explode function not working for list of string column
                            
                                pandas combine stock data if it falls between specific time only in dataframe
                            
                                Is there any way to define a Python function with leading optional arguments?
                            
                                qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found."
                            
                                How to raise a DeprecationWarning when catching an exception with python?
                            
                                KeyError on If-Condition in dictionary Python
                            
                                Is there unstack in NumPy?
                            
                                Flask App not starting (TypeError: code() takes at least 14 arguments (13 given))
                            
                                Pandas loc error: 'Series' objects are mutable, thus they cannot be hashed
                            
                                Using Playwright for Python, how do I select an option from a drop down list?
                            
                                How to melt a dataframe while doing some operation?
                            
                                How can I make a distance matrix with own metric using no loop?
                            
                                Does Pytest cache fixture data when called by multiple test functions?
                            
                                How to create sum of columns in Pandas based on a conditional of multiple columns?
                            
                                Plotting two dataframes obtained from a loop in the same graph Python
                            
                                AttributeError: 'NoneType' object has no attribute 'excluded_of'
                            
                                trying to find the current project id of the deployed python function in google cloud gives error
                            
                                How do I turn off the "Evaluating: plt.show() did not finish after 3.00s seconds." warning in the VsCode debugger?
                            
                                How to view opts for Holoviews with Bokeh in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to handle job cancelation in Slurm?

Tags:

python

signals

slurm

Dmitry Kabanov

People also ask

2 Answers

damienfrancois

Ricardo Magalhães Cruz

Recent Activity

Donate For Us