I have recently become interested in incorporating distributed training into my Tensorflow projects. I am using Google Colab and Python 3 to implement a Neural Network with customized, distributed, training loops, as described in this guide: https://www.tensorflow.org/tutorials/distribute/training_loops
In that guide under section 'Create a strategy to distribute the variables and the graph', there is a picture of some code that basically sets up a 'MirroredStrategy' and then prints the number of generated replicas of the model, see below.
Console output
From what I can understand, the output indicates that the MirroredStrategy has only created one replica of the model, and thereofore, only one GPU will be used to train the model. My question: is Google Colab limited to training on a single GPU?
I have tried to call MirroredStrategy() both with, and without, GPU acceleration, but I only get one model replica every time. This is a bit surprising because when I use the multiprocessing package in Python, I get four threads. I therefore expected that it would be possible to train four models in parallel in Google Colab. Are there issues with Tensorflows implementation of distributed training?
On google colab, you can only use one GPU, that is the limit from Google. However, you can run different programs on different gpu instances so by creating different colab files and connect them with gpus but you can not place the same model on many gpu instances in parallel. There are no problems with mirrored startegy, talking from personal experience it works fine if you have more than one GPU.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With