I suppose it's a pretty trivial question but nevertheless, I'm looking for the (sacct I guess) command that will display the CPU time and memory used by a slurm job ID.

The other answers all detail formats for output of <code>sacct</code>, which is great for looking at multiple jobs aggregated in a table. However, sometimes you want to look at a specific job in more detail, so you can tell whether your job efficiently used the allocated resources. For that, <code>seff</code> is very useful. The syntax is simply <code>seff <Jobid></code>. For example, here's a recent job of mine (that failed): <pre class="prettyprint lang-sh prettyprint-override"><code>$ seff 15780625 Job ID: 15780625 Cluster: mycluster User/Group: myuser/mygroup State: OUT_OF_MEMORY (exit code 0) Nodes: 1 Cores per node: 16 CPU Utilized: 12:06:01 CPU Efficiency: 85.35% of 14:10:40 core-walltime Job Wall-clock time: 00:53:10 Memory Utilized: 1.41 GB Memory Efficiency: 70.47% of 2.00 GB </code></pre> Note that the key CPU metric, CPU Utilized, corresponds to the TotalCPU field from <code>sacct</code>, while Memory Utilized corresponds to MaxRSS.

<pre class="prettyprint"><code>sacct -a -j <job_id> --format=user%10,jobname%10,node%10,start%10,end%10,elapsed%10,MaxRS </code></pre> use the command <code>sacct</code> to access finished slurm job history for <code><job_id></code>, it's referring to the slurm job id then, this <code>--format=</code> to mention the different details to display, with which format: <ul> <li>the <code>user</code>: the user run the job</li> <li>the <code>jobname</code>: the job or process name</li> <li>the <code>node</code>: this to indicate in which machine the job was done</li> <li>the <code>start</code> and <code>end</code> are indicating successively the job start and end dates</li> <li>for <code>elapsed</code> it's about the runtime of job or process,</li> <li>and <code>MaxRS</code> for max cpus used to get the job done</li> <li>for <code>%</code> it's used to determine how much characters to dedicated for printing a given info (e.g jobname%25: Jobname will be displayed in 25 characters)</li> </ul>

Find out the CPU time and memory usage of a slurm job

7 Answers

If your job is finished, then the sacct command is what you're looking for. Otherwise, look into sstat. For sacct the --format switch is the other key element. If you run this command:

sacct -e

you'll get a printout of the different fields that can be used for the --format switch. The details of each field are described in the Job Account Fields section of the man page. For CPU time and memory, CPUTime and MaxRSS are probably what you're looking for. cputimeraw can also be used if you want the number in seconds, as opposed to the usual Slurm time format.

sacct --format="CPUTime,MaxRSS"

answered Oct 02 '22 14:10

aaron.kizmiller

The other answers all detail formats for output of sacct, which is great for looking at multiple jobs aggregated in a table.

However, sometimes you want to look at a specific job in more detail, so you can tell whether your job efficiently used the allocated resources. For that, seff is very useful. The syntax is simply seff <Jobid>. For example, here's a recent job of mine (that failed):

$ seff 15780625

Job ID: 15780625
Cluster: mycluster
User/Group: myuser/mygroup
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 12:06:01
CPU Efficiency: 85.35% of 14:10:40 core-walltime
Job Wall-clock time: 00:53:10
Memory Utilized: 1.41 GB
Memory Efficiency: 70.47% of 2.00 GB

Note that the key CPU metric, CPU Utilized, corresponds to the TotalCPU field from sacct, while Memory Utilized corresponds to MaxRSS.

answered Oct 02 '22 14:10

spinup

sacct is indeed the command to use for finished jobs. For running jobs, you can look at the sstat command.

answered Oct 02 '22 13:10

damienfrancois

@aaron.kizmiller is right, sacct is the command to use.

One can fetch all of the following fields by passing them into saact --format="field,field"

Fields:

Account           AdminComment      AllocCPUS         AllocGRES
AllocNodes        AllocTRES         AssocID           AveCPU
AveCPUFreq        AveDiskRead       AveDiskWrite      AvePages
AveRSS            AveVMSize         BlockID           Cluster
Comment           ConsumedEnergy    ConsumedEnergyRaw CPUTime
CPUTimeRAW        DerivedExitCode   Elapsed           ElapsedRaw
Eligible          End               ExitCode          GID
Group             JobID             JobIDRaw          JobName
Layout            MaxDiskRead       MaxDiskReadNode   MaxDiskReadTask
MaxDiskWrite      MaxDiskWriteNode  MaxDiskWriteTask  MaxPages
MaxPagesNode      MaxPagesTask      MaxRSS            MaxRSSNode
MaxRSSTask        MaxVMSize         MaxVMSizeNode     MaxVMSizeTask
McsLabel          MinCPU            MinCPUNode        MinCPUTask
NCPUS             NNodes            NodeList          NTasks
Priority          Partition         QOS               QOSRAW
ReqCPUFreq        ReqCPUFreqMin     ReqCPUFreqMax     ReqCPUFreqGov
ReqCPUS           ReqGRES           ReqMem            ReqNodes
ReqTRES           Reservation       ReservationId     Reserved
ResvCPU           ResvCPURAW        Start             State
Submit            Suspended         SystemCPU         Timelimit
TotalCPU          UID               User              UserCPU
WCKey             WCKeyID           WorkDir

For example, to list all job ids, elapsed time, and max VM size, you can run:

sacct --format='JobID,Elapsed,MaxVMSize'

answered Oct 02 '22 13:10

duhaime

Although there already exist fantastic solutions, I share an another perspective.

This method can do the real time monitoring of a lot of nodes.

We can write a script monitor.sh to obtain the statistic(memory as an example), then logged it into file.

#! /bin/sh
if [ -f "./free.log_"`hostname` ];then
    echo "file existed , now deleting it !"
    rm ./free.log_`hostname`
fi
echo "start recording!"

while true
do
    echo "******["`date +%Y-%m-%d_%H:%M:%S`"]******" >> free.log_`hostname`
    free -s 1 -c 2 -h|sed -n 1,2p >> free.log_`hostname`
done

Then write your job script sbatch_input.sh, which can be called by sbatch.

#! /bin/sh
#SBATCH -N 2
#SBATCH -p cnall
srun hostname
srun ./monitor.sh

Call the script

sbatch ./sbatch_input.sh

We can see some log generated.

answered Oct 02 '22 13:10

Xu Hui

You can export SACCT_FORMAT and just type sacct every time.

$ export SACCT_FORMAT="JobID%20,JobName,User,Partition,NodeList,Elapsed,CPUTime,State,AllocTRES%32"
$ sacct
               JobID    JobName      User  Partition        NodeList    Elapsed    CPUTime      State                        AllocTRES 
-------------------- ---------- --------- ---------- --------------- ---------- ---------- ---------- -------------------------------- 
            249527_4   xgb_tune    zhaoqi         cn            cn12   00:26:50 1-11:46:40    RUNNING billing=80,cpu=80,mem=100G,node+ 
            249527_1   xgb_tune    zhaoqi         cn            cn09   00:26:50 1-11:46:40    RUNNING billing=80,cpu=80,mem=100G,node+ 
            249527_2   xgb_tune    zhaoqi         cn            cn10   00:26:50 1-11:46:40    RUNNING billing=80,cpu=80,mem=100G,node+ 
            249527_3   xgb_tune    zhaoqi         cn            cn11   00:26:50 1-11:46:40    RUNNING billing=80,cpu=80,mem=100G,node+

ref: https://docs.ycrc.yale.edu/clusters-at-yale/job-scheduling/resource-usage/

answered Oct 02 '22 13:10

Shixiang Wang

sacct -a -j <job_id> --format=user%10,jobname%10,node%10,start%10,end%10,elapsed%10,MaxRS

use the command sacct to access finished slurm job history

for <job_id>, it's referring to the slurm job id then, this --format= to mention the different details to display, with which format:

the user: the user run the job
the jobname: the job or process name
the node: this to indicate in which machine the job was done
the start and end are indicating successively the job start and end dates
for elapsed it's about the runtime of job or process,
and MaxRS for max cpus used to get the job done
for % it's used to determine how much characters to dedicated for printing a given info (e.g jobname%25: Jobname will be displayed in 25 characters)

answered Oct 02 '22 13:10

Nour

Related questions
                            
                                How to run code in a debugging session from VS code on a remote using an interactive session?
                            
                                Changing the bash script sent to sbatch in slurm during run a bad idea?
                            
                                How to find from where a job is submitted in SLURM?
                            
                                Slurm: What is the difference for code executing under salloc vs srun
                            
                                Limit the number of running jobs in SLURM
                            
                                Use Bash variable within SLURM sbatch script
                            
                                Slurm: Why use srun inside sbatch?
                            
                                SLURM: How to run 30 jobs on particular nodes only?
                            
                                How do I save print statements when running a program in SLURM?
                            
                                SLURM: see how many cores per node, and how many cores per job
                            
                                How do the terms "job", "task", and "step" relate to each other?
                            
                                How to submit a job to any [subset] of nodes from nodelist in SLURM?
                            
                                HPC cluster: select the number of CPUs and threads in SLURM sbatch
                            
                                Use slurm job id
                            
                                What does the status "CG" mean in SLURM?
                            
                                Expand columns to see full jobname in Slurm
                            
                                How to "undrain" slurm nodes in drain state
                            
                                What does the state 'drain' mean?
                            
                                Pass command line arguments via sbatch
                            
                                What does the --ntasks or -n tasks does in SLURM?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find out the CPU time and memory usage of a slurm job

Tags:

slurm

user1701545

People also ask

7 Answers

aaron.kizmiller

spinup

damienfrancois

duhaime

Xu Hui

Shixiang Wang

Nour

Recent Activity

Donate For Us