Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to define multiple gres resources in SLURM using the same GPU device?

I'm running machine learning (ML) jobs that make use of very little GPU memory. Thus, I could run multiple ML jobs on a single GPU.

To achieve that, I would like to add multiple lines in the gres.conf file that specify the same device. However, it seems the slurm deamon doesn't accept this, the service returning:

fatal: Gres GPU plugin failed to load configuration

Is there any option I'm missing to make this work?

Or maybe a different way to achieve that with SLURM?

It is kind of smiliar to this one, but this one seems specific to some CUDA code with compilation enabled. Something which seems way more specific than my general case (or at least as far as I understand). How to run multiple jobs on a GPU grid with CUDA using SLURM

like image 283
GDegottex Avatar asked Nov 19 '25 20:11

GDegottex


1 Answers

I don't think you can oversubscribe GPUs, so I see two options:

  1. You can configure the CUDA Multi-Process Service or
  2. pack multiple calculations into a single job that has one GPU and run them in parallel.
like image 70
Marcus Boden Avatar answered Nov 23 '25 13:11

Marcus Boden



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!