SLURM `srun` vs `sbatch` and their parameters

People also ask

What is Slurm SRUN?

You can use the Slum command srun to allocate an interactive job. This means you use specific options with srun on the command line to tell Slurm what resources you need to run your job, such as number of nodes, amount of memory, and amount of time.

What is Sbatch Slurm?

sbatch submits a batch script to Slurm. The batch script may be given to sbatch through a file name on the command line, or if no file name is specified, sbatch will read in a script from standard input. The batch script may contain options preceded with "#SBATCH" before any executable commands in the script.

How do you specify nodes in Slurm?

You need to use -w node0xx or --nodelist=node0xx . You need to provide the partition too lest you want to get a "requested node not in this partition" error as some nodes can be in several partitions (in my case we have a node that's in the fat and the fat_short partitions).

What does SRUN hostname do?

srun is the command used to run a process on the compute nodes in the cluster. It works by passing it a command (this could be a script) which will be run on a compute node and then srun will return. srun accepts many command line options to specify the resources required by the command passed to it.

The documentation says

srun is used to submit a job for execution in real time

while

sbatch is used to submit a job script for later execution.

They both accept practically the same set of parameters. The main difference is that srun is interactive and blocking (you get the result in your terminal and you cannot write other commands until it is finished), while sbatch is batch processing and non-blocking (results are written to a file and you can submit other commands right away).

If you use srun in the background with the & sign, then you remove the 'blocking' feature of srun, which becomes interactive but non-blocking. It is still interactive though, meaning that the output will clutter your terminal, and the srun processes are linked to your terminal. If you disconnect, you will loose control over them, or they might be killed (depending on whether they use stdout or not basically). And they will be killed if the machine to which you connect to submit jobs is rebooted.

If you use sbatch, you submit your job and it is handled by Slurm ; you can disconnect, kill your terminal, etc. with no consequence. Your job is no longer linked to a running process.

What are some things that I can do with one that I cannot do with the other, and why?

A feature that is available to sbatch and not to srun is job arrays. As srun can be used within an sbatch script, there is nothing that you cannot do with sbatch.

How are these related to each other, and how do they differ for srun vs sbatch?

All the parameters --ntasks, --nodes, --cpus-per-task, --ntasks-per-node have the same meaning in both commands. That is true for nearly all parameters, with the notable exception of --exclusive.

What is happening "under the hood" that causes this to be the case?

srun immediately executes the script on the remote host, while sbatch copies the script in an internal storage and then uploads it on the compute node when the job starts. You can check this by modifying your submission script after it has been submitted; changes will not be taken into account (see this).

How do they interact with each other, and what is the "canonical" use-case for each of them?

You typically use sbatch to submit a job and srun in the submission script to create job steps as Slurm calls them. srun is used to launch the processes. If your program is a parallel MPI program, srun takes care of creating all the MPI processes. If not, srun will run your program as many times as specified by the --ntasks option. There are many use cases depending on whether your program is paralleled or not, has a long-running time or not, is composed of a single executable or not, etc. Unless otherwise specified, srun inherits by default the pertinent options of the sbatch or salloc which it runs under (from here).

Specifically, would I ever use srun by itself?

Other than for small tests, no. A common use is srun --pty bash to get a shell on a compute job.

This doesn't actually fully answer the question, but here is some more information I found that may be helpful for someone in the future:

From a related thread I found with a similar question:

In a nutshell, sbatch and salloc allocate resources to the job, while srun launches parallel tasks across those resources. When invoked within a job allocation, srun will launch parallel tasks across some or all of the allocated resources. In that case, srun inherits by default the pertinent options of the sbatch or salloc which it runs under. You can then (usually) provide srun different options which will override what it receives by default. Each invocation of srun within a job is known as a job step.

srun can also be invoked outside of a job allocation. In that case, srun requests resources, and when those resources are granted, launches tasks across those resources as a single job and job step.

There's a relatively new web page which goes into more detail regarding the -B and --exclusive options.

doc/html/cpu_management.shtml

Additional information from the SLURM FAQ page.

The srun command has two different modes of operation. First, if not run within an existing job (i.e. not within a Slurm job allocation created by salloc or sbatch), then it will create a job allocation and spawn an application. If run within an existing allocation, the srun command only spawns the application. For this question, we will only address the first mode of operation and compare creating a job allocation using the sbatch and srun commands.

The srun command is designed for interactive use, with someone monitoring the output. The output of the application is seen as output of the srun command, typically at the user's terminal. The sbatch command is designed to submit a script for later execution and its output is written to a file. Command options used in the job allocation are almost identical. The most noticable difference in options is that the sbatch command supports the concept of job arrays, while srun does not. Another significant difference is in fault tolerance. Failures involving sbatch jobs typically result in the job being requeued and executed again, while failures involving srun typically result in an error message being generated with the expectation that the user will respond in an appropriate fashion.

Another relevant conversation here

Related questions
                            
                                OpenMP and Python
                            
                                Coordinating parallel execution in node.js
                            
                                Simplest way to run three methods in parallel in C#
                            
                                How to configure a fine tuned thread pool for futures?
                            
                                Parallel wget in Bash [duplicate]
                            
                                How to iterate over consecutive chunks of Pandas dataframe efficiently
                            
                                How does Java makes use of multiple cores?
                            
                                Parallel.ForEach with adding to list
                            
                                Parallel mapM on Repa arrays
                            
                                How to create threads in nodejs
                            
                                multiprocessing: Understanding logic behind `chunksize`
                            
                                Sharing a result queue among several processes
                            
                                Running programs in parallel using xargs
                            
                                Does Parallel.ForEach limit the number of active threads?
                            
                                Haskell threads heap overflow despite only 22Mb total memory usage?
                            
                                How does the MapReduce sort algorithm work?
                            
                                How to wait for a number of threads to complete?
                            
                                deciding among subprocess, multiprocessing, and thread in Python?
                            
                                Break parallel.foreach?
                            
                                How expensive is the lock statement?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SLURM `srun` vs `sbatch` and their parameters

Tags:

parallel-processing

jobs

scheduler

slurm

sbatch

People also ask

Recent Activity

Donate For Us