Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check TPU workload/utilization

I am training a model, and when I open the TPU in the Google Cloud Platform console, it shows me the CPU utilization (on the TPU, I suppose). It is really, really, low (like 0.07%), so maybe it is the VM CPU? I am wondering whether the training is really proper or if the TPUs are just that strong.

Is there any other way to check the TPU usage? Maybe with a ctpu command?

like image 246
craft Avatar asked Sep 20 '18 14:09

craft


People also ask

How do I check my TPU?

Before you run this Colab notebook, make sure that your hardware accelerator is a TPU by checking your notebook settings: Runtime > Change runtime type > Hardware accelerator > TPU.

How many cores does a TPU have?

A single Cloud TPU chip contains 2 cores, each of which contains multiple matrix units (MXUs) designed to accelerate programs dominated by dense matrix multiplications and convolutions (see System Architecture).

Is Google TPU free?

Use Cloud TPUs for free, right in your browser If you'd like to get started with Cloud TPUs right away, you can access them for free in your browser using Google Colab.


2 Answers

I would recommend using the TPU profiling tools that plug into TensorBoard. A good tutorial for install and use of these tools can be found here.

You'll run the profiler while your TPU is training. It will add an extra tab to your TensorBoard with TPU-specific profiling information. Among the most useful:

  • Average step time
  • Host idle time (how much time the CPU spends idling)
  • TPU idle time
  • Utilization of TPU Matrix units

Based on these metrics, the profiler will suggest ways to start optimizing your model to train well on a TPU. You can also dig into the more sophisticated profiling tools like a trace viewer, or a list of the most expensive graph operations.

For some guidelines on performance tuning (in addition to those ch_mike already linked) you can look at the TPU performance guide.

like image 96
Auberon López Avatar answered Sep 19 '22 07:09

Auberon López


If you are looking at GCP -> Compute Engine -> TPU, you are looking at the correct spot. If you see the monitoring graphs of your associated Compute Engine instance, you’ll see the CPU graph is different.

Currently, it doesn’t seem to be any other way to look for that information, since none of these options provide it:

gcloud compute tpus describe <tpu-name> --zone=<zone>

ctpu status --details

Nor does the TPU API

As whether your training is proper or not, it would be hard to say, you can refer to Using TPU and make sure you are following the guidelines there. Another useful resource would be Improving training speed.

like image 23
ch_mike Avatar answered Sep 22 '22 07:09

ch_mike