I suppose it's a pretty trivial question but nevertheless, I'm looking for the (sacct I guess) command that will display the CPU time and memory used by a slurm job ID.
Memory Allocation in Slurm You simply specify it using --memory=<size> in your srun and sbatch command. In the (rare) case that you provide more flexible number of threads (Slurm tasks) or GPUs, you could also look into --mem-per-cpu and --mem-per-gpu .
CPU Efficiency is calculated as the ratio of the actual core time from all cores divided by the number of cores requested divided by the run time.
You can see all jobs running under the account by running squeue -A account_name and then find out more information on each job by scontrol show job <jobid> . ReqNodeNotAvail - If you have requested a specific node and it is currently scheduled you can get this job code.
If your job is finished, then the sacct
command is what you're looking for. Otherwise, look into sstat
. For sacct
the --format switch is the other key element. If you run this command:
sacct -e
you'll get a printout of the different fields that can be used for the --format switch. The details of each field are described in the Job Account Fields section of the man page. For CPU time and memory, CPUTime and MaxRSS are probably what you're looking for. cputimeraw can also be used if you want the number in seconds, as opposed to the usual Slurm time format.
sacct --format="CPUTime,MaxRSS"
The other answers all detail formats for output of sacct
, which is great for looking at multiple jobs aggregated in a table.
However, sometimes you want to look at a specific job in more detail, so you can tell whether your job efficiently used the allocated resources. For that, seff
is very useful. The syntax is simply seff <Jobid>
. For example, here's a recent job of mine (that failed):
$ seff 15780625
Job ID: 15780625
Cluster: mycluster
User/Group: myuser/mygroup
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 12:06:01
CPU Efficiency: 85.35% of 14:10:40 core-walltime
Job Wall-clock time: 00:53:10
Memory Utilized: 1.41 GB
Memory Efficiency: 70.47% of 2.00 GB
Note that the key CPU metric, CPU Utilized, corresponds to the TotalCPU field from sacct
, while Memory Utilized corresponds to MaxRSS.
sacct
is indeed the command to use for finished jobs. For running jobs, you can look at the sstat
command.
@aaron.kizmiller is right, sacct
is the command to use.
One can fetch all of the following fields by passing them into saact --format="field,field"
Fields:
Account AdminComment AllocCPUS AllocGRES
AllocNodes AllocTRES AssocID AveCPU
AveCPUFreq AveDiskRead AveDiskWrite AvePages
AveRSS AveVMSize BlockID Cluster
Comment ConsumedEnergy ConsumedEnergyRaw CPUTime
CPUTimeRAW DerivedExitCode Elapsed ElapsedRaw
Eligible End ExitCode GID
Group JobID JobIDRaw JobName
Layout MaxDiskRead MaxDiskReadNode MaxDiskReadTask
MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask MaxPages
MaxPagesNode MaxPagesTask MaxRSS MaxRSSNode
MaxRSSTask MaxVMSize MaxVMSizeNode MaxVMSizeTask
McsLabel MinCPU MinCPUNode MinCPUTask
NCPUS NNodes NodeList NTasks
Priority Partition QOS QOSRAW
ReqCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov
ReqCPUS ReqGRES ReqMem ReqNodes
ReqTRES Reservation ReservationId Reserved
ResvCPU ResvCPURAW Start State
Submit Suspended SystemCPU Timelimit
TotalCPU UID User UserCPU
WCKey WCKeyID WorkDir
For example, to list all job ids, elapsed time, and max VM size, you can run:
sacct --format='JobID,Elapsed,MaxVMSize'
Although there already exist fantastic solutions, I share an another perspective.
This method can do the real time monitoring of a lot of nodes.
We can write a script monitor.sh
to obtain the statistic(memory as an example), then logged it into file.
#! /bin/sh
if [ -f "./free.log_"`hostname` ];then
echo "file existed , now deleting it !"
rm ./free.log_`hostname`
fi
echo "start recording!"
while true
do
echo "******["`date +%Y-%m-%d_%H:%M:%S`"]******" >> free.log_`hostname`
free -s 1 -c 2 -h|sed -n 1,2p >> free.log_`hostname`
done
Then write your job script sbatch_input.sh
, which can be called by sbatch.
#! /bin/sh
#SBATCH -N 2
#SBATCH -p cnall
srun hostname
srun ./monitor.sh
Call the script
sbatch ./sbatch_input.sh
We can see some log generated.
You can export SACCT_FORMAT
and just type sacct
every time.
$ export SACCT_FORMAT="JobID%20,JobName,User,Partition,NodeList,Elapsed,CPUTime,State,AllocTRES%32"
$ sacct
JobID JobName User Partition NodeList Elapsed CPUTime State AllocTRES
-------------------- ---------- --------- ---------- --------------- ---------- ---------- ---------- --------------------------------
249527_4 xgb_tune zhaoqi cn cn12 00:26:50 1-11:46:40 RUNNING billing=80,cpu=80,mem=100G,node+
249527_1 xgb_tune zhaoqi cn cn09 00:26:50 1-11:46:40 RUNNING billing=80,cpu=80,mem=100G,node+
249527_2 xgb_tune zhaoqi cn cn10 00:26:50 1-11:46:40 RUNNING billing=80,cpu=80,mem=100G,node+
249527_3 xgb_tune zhaoqi cn cn11 00:26:50 1-11:46:40 RUNNING billing=80,cpu=80,mem=100G,node+
ref: https://docs.ycrc.yale.edu/clusters-at-yale/job-scheduling/resource-usage/
sacct -a -j <job_id> --format=user%10,jobname%10,node%10,start%10,end%10,elapsed%10,MaxRS
use the command sacct
to access finished slurm job history
for <job_id>
, it's referring to the slurm job id
then, this --format=
to mention the different details to display, with which format:
user
: the user run the jobjobname
: the job or process namenode
: this to indicate in which machine the job was donestart
and end
are indicating successively the job start and end dateselapsed
it's about the runtime of job or process,MaxRS
for max cpus used to get the job done%
it's used to determine how much characters to dedicated for printing a given info (e.g jobname%25: Jobname will be displayed in 25 characters)If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With