Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to pause currently running submission scripts in SLURM?

Tags:

slurm

I have a few scripts that I sent to a cluster I have access using sbatch. However, I'd like to pause these running scripts without necessarily cancelling their work since they have been running for some time now.

Is there a way to hold/pause currently running jobs without cancelling the job they have already made?

I found in the following website that one can pause them with:

To pause a particular job:

scontrol hold <jobid>

however, I'm still a bit unsure how to make it work with job arrays.

like image 395
Charlie Parker Avatar asked Oct 10 '16 18:10

Charlie Parker


People also ask

Can you pause a job on SLURM?

The scontrol command provides users extended control of their jobs run through Slurm. This includes actions like suspending a job, holding a job from running, or pulling extensive status information on jobs.

How do I know if my job is running in SLURM?

You can see all jobs running under the account by running squeue -A account_name and then find out more information on each job by scontrol show job <jobid> . ReqNodeNotAvail - If you have requested a specific node and it is currently scheduled you can get this job code.

How do I cancel my job on SLURM?

To cancel a job, invoke scancel without --signal option. This will send first a SIGCONT to all steps to eventually wake them up followed by a SIGTERM, then wait the KillWait duration defined in the slurm. conf file and finally if they have not terminated send a SIGKILL.

Which SLURM command is used to submit a batch job?

Use a batch job to recieve an allocation of compute resources and have your commands run there. sbatch is the slurm function to submit a script or . slurm submission script as a batch job. Here is a simple example, submitting a bash script as a batch job.


1 Answers

I believe

scontrol suspend

does what you want. From the documentation:

suspend job_list

Suspend a running job. The job_list argument is a comma separated list of job IDs. Use the resume command to resume its execution. User processes must stop on receipt of SIGSTOP signal and resume upon receipt of SIGCONT for this operation to be effective. Not all architectures and configurations support job suspension. If a suspended job is requeued, it will be placed in a held state.

like image 177
franjesus Avatar answered Sep 22 '22 16:09

franjesus