I have a server (Ubuntu 16.04) with 4 GPUs. My team shares this, and our current approach is to containerize all of our work with Docker, and to restrict containers to GPUs using something like $ NV_GPU=0 nvidia-docker run -ti nvidia/cuda nvidia-smi. This works well when we're all very clear about who's using which GPU, but our team has grown and I'd like a more robust way of monitoring GPU use and prohibit access to GPUs when they're in use. nvidia-smi is one channel of information with the "GPU-Util", but sometimes the GPU may have a 0% GPU-Util at one moment while it is currently reserved by someone working in a container.
Do you have any recommendations for:
$ NV_GPU='gpu_id' nvidia-docker run$ NV_GPU='same_gpu_id' nvidia-docker runI may be thinking about this the wrong way too, so open to other ideas. Thanks!
Sounds like a great place to apply CI/CD practises. What you need is a job queue. Each user may request to use the resources (=GPUs) by triggering the pipeline in some way e.g. pushing a commit on a specific branch. Then, an automatic system will allocate the shared resources in an ordered manner and everybody will eventually get their experiments done.
This is probably the most scalable way to do this. Much more than reservation calendars or ad hoc usage. The only way that is more scalable is to buy compute from cloud but that is not in the scope of OPs question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With