I have been allocated multiple Google Cloud TPUs in the us-central1-f
region. The machine types are all v2-8
.
How can I utilize all my TPUs to train a single model?
The us-central1-f
region doesn't support pods, so using pods doesn't seem like the solution. Even if pods were available, the number of v2-8 units that I have does not match any of the pod TPU slice sizes (16, 64, 128, 256), so I couldn't use them all in a single pod.
Though I can't find documentation which explicitly answers this question, I have read multiple articles and questions and come to the conclusion that if you are using v2-8
or v3-8
TPUs, it is not possible to use multiple of them at a time. You will have to use larger machines like v2-32
or v3-32
to ensure you have access to more cores, and the TFRC program does not provide that for free.
References:
I believe you cannot easily do this. If you want to train a single model using multiple TPUs, you would need to have access to a region with TPU Pods. Otherwise you can do the obvious thing: train the same model on different TPUs but with different hyperparameters as a way to do grid search OR you can train multiple weak learners and then manually combine them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With