Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading a HuggingFace model on multiple GPUs using model parallelism for inference

I have access to six 24GB GPUs. When I try to load some HuggingFace models, for example the following

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("google/ul2")
model = AutoModelForSeq2SeqLM.from_pretrained("google/ul2")

I get an out of memory error, as the model only seems to be able to load on a single GPU. However, while the whole model cannot fit into a single 24GB GPU card, I have 6 of these and would like to know if there is a way to distribute the model loading across multiple cards, to perform inference.

HuggingFace seems to have a webpage where they explain how to do this but it has no useful content as of today.

like image 231
andrea Avatar asked Nov 22 '25 04:11

andrea


1 Answers

When you load the model using from_pretrained(), you need to specify which device you want to load the model to. Thus, add the following argument, and the transformers library will take care of the rest:

model = AutoModelForSeq2SeqLM.from_pretrained("google/ul2", device_map = 'auto')

Passing "auto" here will automatically split the model across your hardware in the following priority order: GPU(s) > CPU (RAM) > Disk.

Of course, this answer assumes you have cuda installed and your environment can see the available GPUs. Running nvidia-smi from a command-line will confirm this. Please report back if you run into further issues.

like image 134
dcruiz01 Avatar answered Nov 23 '25 18:11

dcruiz01



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!