Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SLURM sacct shows 'batch' and 'extern' job names

Tags:

slurm

I have submitted a job to a SLURM queue, the job has run and completed. I then check the completed jobs using the sacct command. But looking at the results of the sacct command I notice additional results that I did not expect:

       JobID                        JobName      State      NCPUS  Timelimit
5297048                                test  COMPLETED          1   00:10:00  
5297048.bat+                          batch  COMPLETED          1           
5297048.ext+                         extern  COMPLETED          1       

Can anyone explain what the 'batch' and 'extern' jobs are and what their purpose is. Why does the extern job always complete even when the primary job fails.

I have attempted to search the documentation but have not found a satisfactory and complete answer.

EDIT: Here's the script I am submitting to produce the above sacct output:

#!/bin/bash
echo test_script > done.txt

With the following sbatch command:

sbatch -A BRIDGE-CORE-SL2-CPU --nodes=1 --ntasks=1 -p skylake --cpus-per-task 1 -J jobname -t 00:10:00 --output=./output.out --error=./error.err < test.sh
like image 778
Parsa Avatar asked Sep 21 '18 16:09

Parsa


People also ask

How do I see completed jobs in slurm?

You can get statistics (accounting data) on completed jobs by passing either the jobID or username flags. Here, the command sacct -j 215578 is used to show statistics about the completed job. This shows information such as: the partition your job executed on, the account, and number of allocated CPUS per job steps.

What is Sacct?

The sacct command displays job accounting data stored in the job accounting log file or Slurm database in a variety of forms for your analysis. The sacct command displays information on jobs, job steps, status, and exitcodes by default.

What is Walltime slurm?

Wall clock time The duration can be specified in minutes, or in the MM:SS, or HH:MM:SS format (as in the example), of on the D-HH:MM:SS. In general, the maximum walltime for CPU jobs is 120 hours (5 days); for jobs submitted to the GPU queue, the maximum walltime is 120 hours.

What is job step in slurm?

A Slurm job is just a resource allocation. You can execute many job steps within that allocation, either in parallel or sequentially. Some jobs actually launch thousands of job steps this way. The job steps will be allocated nodes that are not already allocated to other job steps.


1 Answers

A Slurm job contains multiple jobsteps, which are all accounted for (in terms of resource usage) separately by Slurm. Usually, these steps are created using srun/mpirun and enumerated starting from 0. But in addition to that, there are sometimes two special steps. For example, take the following job:

sbatch -n 4 --wrap="srun hostname; srun echo Hello World"

This resulted in the following sacct output:

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
5163571            wrap     medium      admin          4  COMPLETED      0:0 
5163571.bat+      batch                 admin          4  COMPLETED      0:0 
5163571.ext+     extern                 admin          4  COMPLETED      0:0 
5163571.0      hostname                 admin          4  COMPLETED      0:0 
5163571.1          echo                 admin          4  COMPLETED      0:0 

The two srun calls created the steps 5163571.0 and 5163571.1. The 5163571.bat+ accounts for the ressources needed by the batch script (which in this case is just srun hostname; srun echo Hello World. --wrap just puts that into a file and adds #!/bin/sh).

Many non-MPI programs do a lot of calculations in the batch step, so the ressource usage is accoutned there.

And now for 5163571.ext+: This step accounts for all resources usage by that job outside of slurm. This only shows up, if the PrologFlag contain is used.

An example of a process belonging to a slurm job, but not directly controlled by slurm are ssh sessions. If you ssh into a node where one of your jobs runs, your session will be placed into the context of the job (and you will be limited to your available resources by cgroups, if that is set up). And all calculations you do in that ssh session will be accounted for in the .extern job step.

like image 91
Marcus Boden Avatar answered Sep 16 '22 23:09

Marcus Boden