We're using SLURM to manage a small on-premise cluster. A key resource we are managing is GPUs. When a user requests GPUs via --gpus=2
the CUDA_VISIBLE_DEVICES
environment variable is set with the GPUs SLURM allocates to the user.
$ srun --gpus=2 bash -c 'echo $CUDA_VISIBLE_DEVICES'
0,1
We have a small team and can trust our users to not abuse the system (they could easily overwrite the environment variable) so this works great. However, it's a bit too easy to bypass this accidentally because when --gpus
isn't specified $CUDA_VISIBLE_DEVICES
is left unset so the user can use any GPU (we're typically using PyTorch).
In other words, the following command works fine (so long as it lands on a GPU node) but I would prefer that it fails (because no GPU was requested).
srun sudo docker run -e CUDA_VISIBLE_DEVICES --runtime=nvidia pytorch/pytorch:1.1.0-cuda10.0-cudnn7.5-runtime python -c 'import torch; print(torch.tensor([1., 2.], device=torch.device("cuda:0")))'
It would fail if $CUDA_VISIBLE_DEVICES
were set to -1
.
$ CUDA_VISIBLE_DEVICES=-1 sudo docker run -e CUDA_VISIBLE_DEVICES --runtime=nvidia pytorch/pytorch:1.1.0-cuda10.0-cudnn7.5-runtime python -c 'import torch; print(torch.tensor([1., 2.], device=torch.device("cuda:0")))'
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/THCGeneral.cpp line=51 error=38 : no CUDA-capable device is detected
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/THCGeneral.cpp:51
How can I configure SLURM to set CUDA_VISIBLE_DEVICES
to -1
when --gpus
is not specified?
You can use the TaskProlog
script to set the $CUDA_VISIBLE_DEVICES
variable to -1
if it was not set by Slurm.
In slurm.conf
, configure TaskProlog=/path/to/prolog.sh
and set the following content for prolog.sh
.
#! /bin/bash
if [[ -z $CUDA_VISIBLE_DEVICES]]; then
echo export CUDA_VISIBLE_DEVICES=-1
fi
The echo export ...
part will inject CUDA_VISIBLE_DEVICES=-1
in the job environment.
Make sure /path/to
is visible from all compute nodes.
But this will not prevent a user from playing the system and redefining the variable from within the Python script. Really preventing access would require configuring cgroups
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With