We're using SLURM to manage a small on-premise cluster. A key resource we are managing is GPUs. When a user requests GPUs via --gpus=2 the CUDA_VISIBLE_DEVICES environment variable is set with the GPUs SLURM allocates to the user.
$ srun --gpus=2 bash -c 'echo $CUDA_VISIBLE_DEVICES'
0,1
We have a small team and can trust our users to not abuse the system (they could easily overwrite the environment variable) so this works great. However, it's a bit too easy to bypass this accidentally because when --gpus isn't specified $CUDA_VISIBLE_DEVICES is left unset so the user can use any GPU (we're typically using PyTorch).
In other words, the following command works fine (so long as it lands on a GPU node) but I would prefer that it fails (because no GPU was requested).
srun sudo docker run -e CUDA_VISIBLE_DEVICES --runtime=nvidia pytorch/pytorch:1.1.0-cuda10.0-cudnn7.5-runtime python -c 'import torch; print(torch.tensor([1., 2.], device=torch.device("cuda:0")))'
It would fail if $CUDA_VISIBLE_DEVICES were set to -1.
$ CUDA_VISIBLE_DEVICES=-1 sudo docker run -e CUDA_VISIBLE_DEVICES --runtime=nvidia pytorch/pytorch:1.1.0-cuda10.0-cudnn7.5-runtime python -c 'import torch; print(torch.tensor([1., 2.], device=torch.device("cuda:0")))'
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/THCGeneral.cpp line=51 error=38 : no CUDA-capable device is detected
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/THCGeneral.cpp:51
How can I configure SLURM to set CUDA_VISIBLE_DEVICES to -1 when --gpus is not specified?
You can use the TaskProlog script to set the $CUDA_VISIBLE_DEVICES variable to -1 if it was not set by Slurm.
In slurm.conf, configure TaskProlog=/path/to/prolog.sh and set the following content for prolog.sh.
#! /bin/bash
if [[ -z $CUDA_VISIBLE_DEVICES]]; then
echo export CUDA_VISIBLE_DEVICES=-1
fi
The echo export ... part will inject CUDA_VISIBLE_DEVICES=-1 in the job environment.
Make sure /path/to is visible from all compute nodes.
But this will not prevent a user from playing the system and redefining the variable from within the Python script. Really preventing access would require configuring cgroups.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With