Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to monitor resources during slurm job?

I'm running jobs on our university cluster (regular user, no admin rights), which uses the SLURM scheduling system and I'm interested in plotting the CPU and memory usage over time, i.e while the job is running. I know about sacct and sstat and I was thinking to include these commands in my submission script, e.g. something in the line of

#!/bin/bash
#SBATCH <options>

# Running the actual job in background
srun my_program input.in output.out &

# While loop that records resources
JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')"
FIRST=0
#sleep time in seconds
STIME=15
while [ "$JobStatus" != "COMPLETED" ]; do
    #update job status
    JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')"
    if [ "$JobStatus" == "RUNNING" ]; then
        if [ $FIRST -eq 0 ]; then
            sstat --format=AveCPU,AveRSS,MaxRSS -P -j ${SLURM_JOB_ID} >> usage.txt
            FIRST=1
        else
            sstat --format=AveCPU,AveRSS,MaxRSS -P --noheader -j ${SLURM_JOB_ID} >> usage.txt
        fi
        sleep $STIME
    elif [ "$JobStatus" == "PENDING" ]; then
        sleep $STIME
    else
        sacct -j ${SLURM_JOB_ID} --format=AllocCPUS,ReqMem,MaxRSS,AveRSS,AveDiskRead,AveDiskWrite,ReqCPUS,AllocCPUs,NTasks,Elapsed,State >> usage.txt
        JobStatus="COMPLETED"
        break
    fi
done

However, I'm not really convinced of this solution:

  • sstat unfortunately doesn't show how many cpus are used at the moment (only average)

  • MaxRSS is also not helpful if I try to record memory usage over time

  • there still seems to be some error (script doesn't stop after job finishes)

Does anyone have an idea how to do that properly? Maybe even with top or htop instead of sstat? Any help is much appreciated.

like image 272
CoffeeNerd Avatar asked May 08 '17 17:05

CoffeeNerd


People also ask

How do you monitor SLURM jobs?

Job informationInformation on all running and pending batch jobs managed by SLURM can be obtained from the SLURM command squeue . Note that information on completed jobs is only retained for a limited period. Information on jobs that ran in the past is via. sacct An example of the output squeue is shown below.

How do I check my CPU usage on a SLURM?

Slurm provides a tool called seff to check the memory utilization and CPU efficiency for completed jobs. Note that for running and failed jobs, the efficiency numbers reported by seff are not reliable so please use this tool only for successfully completed jobs. This job submission script requests 10 tasks in a node.

How do I check cluster memory?

df -h is a unix or linux command to check the total space and available space on a file system of that particular machine.


1 Answers

Slurm offers a plugin to record a profile of a job (PCU usage, memory usage, even disk/net IO for some technologies) into a HDF5 file. The file contains a time series for each measure tracked, and you can choose the time resolution.

You can activate it with

#SBATCH --profile=<all|none|[energy[,|task[,|filesystem[,|network]]]]>

See the documentation here.

To check that this plugin is installed, run

scontrol show config | grep AcctGatherProfileType

It should output AcctGatherProfileType = acct_gather_profile/hdf5.

The files are created in the folder referred to in the ProfileHDF5Dir Slurm configuration parameter (in slurm.conf)

As for your script, you could try replacing sstat with an SSH connection to the compute nodes to run ps. Assuming pdsh or clush is installed, you could run something like:

pdsh -j $SLURM_JOB_ID ps -u $USER -o pid,state,cputime,%cpu,rssize,command --columns 100 >> usage.txt

This will give you CPU and memory usage per process.

As a final note, your job never terminates simply because it will terminate when the while loop terminates, and the while loop will terminate when the job terminates... The condition "$JobStatus" == "COMPLETED" will never be observed from within the script. When the job is completed, the script is killed.

like image 77
damienfrancois Avatar answered Oct 29 '22 13:10

damienfrancois