How to make Google Cloud AI Platform detect `tf.summary.scalar` calls during training?

I have been trying to get Google Cloud's AI platform to display the accuracy of a Keras model, trained on the AI platform. I configured the hyperparameter tuning with hptuning_config.yaml and it works. However I can't get AI platform to pick up tf.summary.scalar calls during training.


I have been following the following documentation pages:

1. Overview of hyperparameter tuning

2. Using hyperparameter tuning

According to [1]:

How AI Platform Training gets your metric You may notice that there are no instructions in this documentation for passing your hyperparameter metric to the AI Platform Training training service. That's because the service monitors TensorFlow summary events generated by your training application and retrieves the metric."

And according to [2], one way of generating such a Tensorflow summary event is by creating a callback class as so:

class MyMetricCallback(tf.keras.callbacks.Callback):

    def on_epoch_end(self, epoch, logs=None):
        tf.summary.scalar('metric1', logs['RootMeanSquaredError'], epoch)

My code

So in my code I included:

# hptuning_config.yaml

    goal: MAXIMIZE
    maxTrials: 4
    maxParallelTrials: 2
    hyperparameterMetricTag: val_accuracy
    - parameterName: learning_rate
      type: DOUBLE
      minValue: 0.001
      maxValue: 0.01
      scaleType: UNIT_LOG_SCALE
# model.py

class MetricCallback(tf.keras.callbacks.Callback):

    def on_epoch_end(self, epoch, logs):
        tf.summary.scalar('val_accuracy', logs['val_accuracy'], epoch)

I even tried

# model.py

class MetricCallback(tf.keras.callbacks.Callback):
    def __init__(self, logdir):
        self.writer = tf.summary.create_file_writer(logdir)

    def on_epoch_end(self, epoch, logs):
        with writer.as_default():
            tf.summary.scalar('val_accuracy', logs['val_accuracy'], epoch)

Which successfully saved the 'val_accuracy' metric to Google storage (I can also see this with TensorBoard). But this does not get picked up by the AI platform, despite the claim made in [1].

Partial solution:

Using the Cloud ML Hypertune package, I created the following class:

# model.py

class MetricCallback(tf.keras.callbacks.Callback):
    def __init__(self):
        self.hpt = hypertune.HyperTune()

    def on_epoch_end(self, epoch, logs):

which works! But I don't see how, since it all it seems to do is write to a file on the AI platform worker at /tmp/hypertune/*. There is nothing in the Google Cloud documentation that explains how this is getting picked up by the AI platform...

Am I missing something in order to get tf.summary.scalar events to be displayed?

1 Answers

I am having the same issue that I can't get AI platform to pick up tf.summary.scalar. I tried to debug it with the GCP support and AI Platform Engineering team for the last 2 months. They didn't manage to reproduce the issue even if we were using almost the same code. We even did one coding session but were still having different results.

Recommendation from the GCP AI Platform Engineering team: "don't use tf.summary.scalar" the main reason is that by using the other method:

  • it works fine for everybody
  • you can control and see what happen (not a blackbox)

They will update the documentation to reflect this new recommendation.


  • Tensoflow 2.2.0
  • TensorBoard 2.2.2
  • keras model is created within the tf.distribute.MirroredStrategy() scope
  • keras callback for TensorBoard

With the following setup the "issue" is observed:

  • when using TensorBoard with update_freq='epoch' and with 1 epoch only

It seems to work with other setup. Anyway I will follow the recommendation from GCP and use the custom solution to avoid issue

