SLURM sacct shows 'batch' and 'extern' job names

Tags:

slurm

I have submitted a job to a SLURM queue, the job has run and completed. I then check the completed jobs using the sacct command. But looking at the results of the sacct command I notice additional results that I did not expect:

       JobID                        JobName      State      NCPUS  Timelimit
5297048                                test  COMPLETED          1   00:10:00  
5297048.bat+                          batch  COMPLETED          1           
5297048.ext+                         extern  COMPLETED          1

Can anyone explain what the 'batch' and 'extern' jobs are and what their purpose is. Why does the extern job always complete even when the primary job fails.

I have attempted to search the documentation but have not found a satisfactory and complete answer.

EDIT: Here's the script I am submitting to produce the above sacct output:

#!/bin/bash
echo test_script > done.txt

With the following sbatch command:

sbatch -A BRIDGE-CORE-SL2-CPU --nodes=1 --ntasks=1 -p skylake --cpus-per-task 1 -J jobname -t 00:10:00 --output=./output.out --error=./error.err < test.sh

778

asked Sep 21 '18 16:09

Parsa

1 Answers

A Slurm job contains multiple jobsteps, which are all accounted for (in terms of resource usage) separately by Slurm. Usually, these steps are created using srun/mpirun and enumerated starting from 0. But in addition to that, there are sometimes two special steps. For example, take the following job:

sbatch -n 4 --wrap="srun hostname; srun echo Hello World"

This resulted in the following sacct output:

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
5163571            wrap     medium      admin          4  COMPLETED      0:0 
5163571.bat+      batch                 admin          4  COMPLETED      0:0 
5163571.ext+     extern                 admin          4  COMPLETED      0:0 
5163571.0      hostname                 admin          4  COMPLETED      0:0 
5163571.1          echo                 admin          4  COMPLETED      0:0

The two srun calls created the steps 5163571.0 and 5163571.1. The 5163571.bat+ accounts for the ressources needed by the batch script (which in this case is just srun hostname; srun echo Hello World. --wrap just puts that into a file and adds #!/bin/sh).

Many non-MPI programs do a lot of calculations in the batch step, so the ressource usage is accoutned there.

And now for 5163571.ext+: This step accounts for all resources usage by that job outside of slurm. This only shows up, if the PrologFlag contain is used.

An example of a process belonging to a slurm job, but not directly controlled by slurm are ssh sessions. If you ssh into a node where one of your jobs runs, your session will be placed into the context of the job (and you will be limited to your available resources by cgroups, if that is set up). And all calculations you do in that ssh session will be accounted for in the .extern job step.

answered Sep 16 '22 23:09

Marcus Boden

Related questions
                            
                                Error only when submitting python job in Slurm
                            
                                How to get original location of script used for SLURM job?
                            
                                How to get the ID of GPU allocated to a SLURM job on a multiple GPUs node?
                            
                                Is it possible to configure the directory for sbatch's default output file?
                            
                                In Slurm, is there a quick command to determine the total number of jobs (pending and active) at a given moment?
                            
                                Installing/emulating SLURM on an Ubuntu 16.04 desktop: slurmd fails to start
                            
                                SLURM sbatch job array for the same script but with different input arguments run in parallel
                            
                                Using python's multiprocessing on slurm
                            
                                How can I get detailed job run info from SLURM (e.g. like that produced for "standard output" by LSF)?
                            
                                seq uses comma as decimal separator
                            
                                How to change how frequently SLURM updates the output file (stdout)?
                            
                                How to activate a specific Python environment as part of my submission to Slurm?
                            
                                Slurm server with a asterisk near the "idle"
                            
                                Is it possible to run SLURM jobs in the background using SRUN instead of SBATCH?
                            
                                Sbatch: pass job name as input argument
                            
                                SLURM display the stdout and stderr of an unfinished job
                            
                                Running a binary without a top level script in SLURM
                            
                                Submit and monitor SLURM jobs using Apache Airflow

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With