Prevent GPU usage in SLURM when --gpus is not set

Question

We're using SLURM to manage a small on-premise cluster. A key resource we are managing is GPUs. When a user requests GPUs via --gpus=2 the CUDA_VISIBLE_DEVICES environment variable is set with the GPUs SLURM allocates to the user.

$ srun --gpus=2 bash -c 'echo $CUDA_VISIBLE_DEVICES'
0,1

We have a small team and can trust our users to not abuse the system (they could easily overwrite the environment variable) so this works great. However, it's a bit too easy to bypass this accidentally because when --gpus isn't specified $CUDA_VISIBLE_DEVICES is left unset so the user can use any GPU (we're typically using PyTorch).

In other words, the following command works fine (so long as it lands on a GPU node) but I would prefer that it fails (because no GPU was requested).

srun sudo docker run -e CUDA_VISIBLE_DEVICES --runtime=nvidia pytorch/pytorch:1.1.0-cuda10.0-cudnn7.5-runtime python -c 'import torch; print(torch.tensor([1., 2.], device=torch.device("cuda:0")))'

It would fail if $CUDA_VISIBLE_DEVICES were set to -1.

$ CUDA_VISIBLE_DEVICES=-1 sudo docker run -e CUDA_VISIBLE_DEVICES --runtime=nvidia pytorch/pytorch:1.1.0-cuda10.0-cudnn7.5-runtime python -c 'import torch; print(torch.tensor([1., 2.], device=torch.device("cuda:0")))'
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/THCGeneral.cpp line=51 error=38 : no CUDA-capable device is detected
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.6/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/THCGeneral.cpp:51

How can I configure SLURM to set CUDA_VISIBLE_DEVICES to -1 when --gpus is not specified?

damienfrancois · Accepted Answer

You can use the TaskProlog script to set the $CUDA_VISIBLE_DEVICES variable to -1 if it was not set by Slurm.

In slurm.conf, configure TaskProlog=/path/to/prolog.sh and set the following content for prolog.sh.

#! /bin/bash

if [[ -z $CUDA_VISIBLE_DEVICES]]; then
echo export CUDA_VISIBLE_DEVICES=-1
fi

The echo export ... part will inject CUDA_VISIBLE_DEVICES=-1 in the job environment.

Make sure /path/to is visible from all compute nodes. But this will not prevent a user from playing the system and redefining the variable from within the Python script. Really preventing access would require configuring cgroups.

Prevent GPU usage in SLURM when --gpus is not set

Tags:

pytorch

slurm

schmmd

1 Answers

damienfrancois

Recent Activity

Donate For Us

Prevent GPU usage in SLURM when --gpus is not set

Tags:

pytorch

slurm

schmmd

1 Answers

damienfrancois

Related questions

Recent Activity

Donate For Us