Check TPU workload/utilization

Tags:

I am training a model, and when I open the TPU in the Google Cloud Platform console, it shows me the CPU utilization (on the TPU, I suppose). It is really, really, low (like 0.07%), so maybe it is the VM CPU? I am wondering whether the training is really proper or if the TPUs are just that strong.

Is there any other way to check the TPU usage? Maybe with a ctpu command?

246

asked Sep 20 '18 14:09

craft

2 Answers

I would recommend using the TPU profiling tools that plug into TensorBoard. A good tutorial for install and use of these tools can be found here.

You'll run the profiler while your TPU is training. It will add an extra tab to your TensorBoard with TPU-specific profiling information. Among the most useful:

Average step time
Host idle time (how much time the CPU spends idling)
TPU idle time
Utilization of TPU Matrix units

Based on these metrics, the profiler will suggest ways to start optimizing your model to train well on a TPU. You can also dig into the more sophisticated profiling tools like a trace viewer, or a list of the most expensive graph operations.

For some guidelines on performance tuning (in addition to those ch_mike already linked) you can look at the TPU performance guide.

answered Sep 19 '22 07:09

Auberon López

If you are looking at GCP -> Compute Engine -> TPU, you are looking at the correct spot. If you see the monitoring graphs of your associated Compute Engine instance, you’ll see the CPU graph is different.

Currently, it doesn’t seem to be any other way to look for that information, since none of these options provide it:

gcloud compute tpus describe <tpu-name> --zone=<zone>

ctpu status --details

Nor does the TPU API

As whether your training is proper or not, it would be hard to say, you can refer to Using TPU and make sure you are following the guidelines there. Another useful resource would be Improving training speed.

answered Sep 22 '22 07:09

ch_mike

Related questions
                            
                                Tensorflow classification with extremely unbalanced dataset
                            
                                Update a subset of weights in TensorFlow
                            
                                Tensorflow, train_step feed incorrect
                            
                                TensorFlow check which protobuf implementation is being used
                            
                                Tensorflow text summarization setup : What is a workspace file?
                            
                                Change constant in tensoflow session while looping
                            
                                How to check NaN in gradients in Tensorflow when updating?
                            
                                Issue with setting TensorFlow as the session in Keras
                            
                                WARNING:tensorflow - initialize_all_variables (from tensorflow.python.ops.variables) is deprecated
                            
                                'Resource exhausted' memory error when trying to train a Keras model
                            
                                Why is the value of a `tf.constant()` stored multiple times in memory in TensorFlow?
                            
                                TensorFlow: Is there a metric to calculate and update top k accuracy?
                            
                                Building a CMake library within a Bazel project
                            
                                Tensorflow Estimator API save image summary in eval mode
                            
                                Tensorflow Combining Two Models End to End
                            
                                AttributeError: 'InputLayer' object has no attribute 'inbound_nodes'
                            
                                How do I create a Keras Embedding layer from a pre-trained word embedding dataset?
                            
                                How to initialize variables defined in tensorflow function?
                            
                                How can I limit regression output between 0 to 1 in keras
                            
                                How to use He initialization in TensorFlow

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Check TPU workload/utilization

Tags:

tensorflow

google-compute-engine

google-cloud-platform

google-cloud-tpu

craft

People also ask

2 Answers

Auberon López

ch_mike

Recent Activity

Donate For Us