Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distributed training in Tensorflow using multiple GPUs in Google Colab

I have recently become interested in incorporating distributed training into my Tensorflow projects. I am using Google Colab and Python 3 to implement a Neural Network with customized, distributed, training loops, as described in this guide: https://www.tensorflow.org/tutorials/distribute/training_loops

In that guide under section 'Create a strategy to distribute the variables and the graph', there is a picture of some code that basically sets up a 'MirroredStrategy' and then prints the number of generated replicas of the model, see below.

Console output

From what I can understand, the output indicates that the MirroredStrategy has only created one replica of the model, and thereofore, only one GPU will be used to train the model. My question: is Google Colab limited to training on a single GPU?

I have tried to call MirroredStrategy() both with, and without, GPU acceleration, but I only get one model replica every time. This is a bit surprising because when I use the multiprocessing package in Python, I get four threads. I therefore expected that it would be possible to train four models in parallel in Google Colab. Are there issues with Tensorflows implementation of distributed training?

like image 203
Markus Eriksson Avatar asked Nov 07 '22 13:11

Markus Eriksson


1 Answers

On google colab, you can only use one GPU, that is the limit from Google. However, you can run different programs on different gpu instances so by creating different colab files and connect them with gpus but you can not place the same model on many gpu instances in parallel. There are no problems with mirrored startegy, talking from personal experience it works fine if you have more than one GPU.

like image 133
Rishabh Sahrawat Avatar answered Nov 15 '22 11:11

Rishabh Sahrawat