Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the ID of GPU allocated to a SLURM job on a multiple GPUs node?

When I submit a SLURM job with the option --gres=gpu:1 to a node with two GPUs, how can I get the ID of the GPU which is allocated for the job? Is there an environment variable for this purpose? The GPUs I'm using are all nvidia GPUs. Thanks.

like image 288
Negelis Avatar asked May 14 '17 18:05

Negelis


People also ask

How do I specify GPU in slurm?

To use a GPU in a Slurm job, you need to explicitly specify this when running the job using the –gres or –gpus flag. The following flags are available: –gres specifies the number of generic resources required per node. –gpus specifies the number of GPUs required for an entire job.

What does Gres GPU mean?

Generic Resources (GRES) are resources associated with a specific node that can be allocated to jobs and steps. The most obvious example of GRES use would be GPUs. GRES are identified by a specific name and use an optional plugin to provide device-specific support.

What is a node in slurm?

Slurm, using the default node allocation plug-in, allocates nodes to jobs in exclusive mode. This means that even when all the resources within a node are not utilized by a given job, another job will not have access to these resources. Nodes possess resources such as processors, memory, swap, local disk, etc.


3 Answers

You can get the GPU id with the environment variable CUDA_VISIBLE_DEVICES. This variable is a comma separated list of the GPU ids assigned to the job.

like image 68
Carles Fenoy Avatar answered Sep 20 '22 03:09

Carles Fenoy


Slurm stores this information in an environment variable, SLURM_JOB_GPUS.

One way to keep track of such information is to log all SLURM related variables when running a job, for example (following Kaldi's slurm.pl, which is a great script to wrap Slurm jobs) by including the following command within the script run by sbatch:

set | grep SLURM | while read line; do echo "# $line"; done
like image 42
leilu Avatar answered Sep 19 '22 03:09

leilu


You can check the environment variables SLURM_STEP_GPUS or SLURM_JOB_GPUS for a given node:

echo ${SLURM_STEP_GPUS:-$SLURM_JOB_GPUS}

Note CUDA_VISIBLE_DEVICES may not correspond to the real value (see @isarandi's comment).

Also, note this should work for non-Nvidia GPUs as well.

like image 44
bryant1410 Avatar answered Sep 18 '22 03:09

bryant1410