Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prevent GPU usage in SLURM when --gpus is not set

Tags:

pytorch

slurm

We're using SLURM to manage a small on-premise cluster. A key resource we are managing is GPUs. When a user requests GPUs via --gpus=2 the CUDA_VISIBLE_DEVICES environment variable is set with the GPUs SLURM allocates to the user.

$ srun --gpus=2 bash -c 'echo $CUDA_VISIBLE_DEVICES'
0,1

We have a small team and can trust our users to not abuse the system (they could easily overwrite the environment variable) so this works great. However, it's a bit too easy to bypass this accidentally because when --gpus isn't specified $CUDA_VISIBLE_DEVICES is left unset so the user can use any GPU (we're typically using PyTorch).

In other words, the following command works fine (so long as it lands on a GPU node) but I would prefer that it fails (because no GPU was requested).

srun sudo docker run -e CUDA_VISIBLE_DEVICES --runtime=nvidia pytorch/pytorch:1.1.0-cuda10.0-cudnn7.5-runtime python -c 'import torch; print(torch.tensor([1., 2.], device=torch.device("cuda:0")))'

It would fail if $CUDA_VISIBLE_DEVICES were set to -1.

$ CUDA_VISIBLE_DEVICES=-1 sudo docker run -e CUDA_VISIBLE_DEVICES --runtime=nvidia pytorch/pytorch:1.1.0-cuda10.0-cudnn7.5-runtime python -c 'import torch; print(torch.tensor([1., 2.], device=torch.device("cuda:0")))'
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/THCGeneral.cpp line=51 error=38 : no CUDA-capable device is detected
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.6/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/THCGeneral.cpp:51

How can I configure SLURM to set CUDA_VISIBLE_DEVICES to -1 when --gpus is not specified?

like image 854
schmmd Avatar asked Aug 22 '19 18:08

schmmd


1 Answers

You can use the TaskProlog script to set the $CUDA_VISIBLE_DEVICES variable to -1 if it was not set by Slurm.

In slurm.conf, configure TaskProlog=/path/to/prolog.sh and set the following content for prolog.sh.

#! /bin/bash

if [[ -z $CUDA_VISIBLE_DEVICES]]; then
echo export CUDA_VISIBLE_DEVICES=-1
fi

The echo export ... part will inject CUDA_VISIBLE_DEVICES=-1 in the job environment.

Make sure /path/to is visible from all compute nodes. But this will not prevent a user from playing the system and redefining the variable from within the Python script. Really preventing access would require configuring cgroups.

like image 189
damienfrancois Avatar answered Sep 26 '22 13:09

damienfrancois