Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make Google Cloud AI Platform detect `tf.summary.scalar` calls during training?

(Note: I have also asked this question here)

Problem

I have been trying to get Google Cloud's AI platform to display the accuracy of a Keras model, trained on the AI platform. I configured the hyperparameter tuning with hptuning_config.yaml and it works. However I can't get AI platform to pick up tf.summary.scalar calls during training.

Documentation

I have been following the following documentation pages:

1. Overview of hyperparameter tuning

2. Using hyperparameter tuning

According to [1]:

How AI Platform Training gets your metric You may notice that there are no instructions in this documentation for passing your hyperparameter metric to the AI Platform Training training service. That's because the service monitors TensorFlow summary events generated by your training application and retrieves the metric."

And according to [2], one way of generating such a Tensorflow summary event is by creating a callback class as so:

class MyMetricCallback(tf.keras.callbacks.Callback):

    def on_epoch_end(self, epoch, logs=None):
        tf.summary.scalar('metric1', logs['RootMeanSquaredError'], epoch)

My code

So in my code I included:

# hptuning_config.yaml

trainingInput:
  hyperparameters:
    goal: MAXIMIZE
    maxTrials: 4
    maxParallelTrials: 2
    hyperparameterMetricTag: val_accuracy
    params:
    - parameterName: learning_rate
      type: DOUBLE
      minValue: 0.001
      maxValue: 0.01
      scaleType: UNIT_LOG_SCALE
# model.py

class MetricCallback(tf.keras.callbacks.Callback):

    def on_epoch_end(self, epoch, logs):
        tf.summary.scalar('val_accuracy', logs['val_accuracy'], epoch)

I even tried

# model.py

class MetricCallback(tf.keras.callbacks.Callback):
    def __init__(self, logdir):
        self.writer = tf.summary.create_file_writer(logdir)

    def on_epoch_end(self, epoch, logs):
        with writer.as_default():
            tf.summary.scalar('val_accuracy', logs['val_accuracy'], epoch)

Which successfully saved the 'val_accuracy' metric to Google storage (I can also see this with TensorBoard). But this does not get picked up by the AI platform, despite the claim made in [1].

Partial solution:

Using the Cloud ML Hypertune package, I created the following class:

# model.py

class MetricCallback(tf.keras.callbacks.Callback):
    def __init__(self):
        self.hpt = hypertune.HyperTune()

    def on_epoch_end(self, epoch, logs):
        self.hpt.report_hyperparameter_tuning_metric(
            hyperparameter_metric_tag='val_accuracy',
            metric_value=logs['val_accuracy'],
            global_step=epoch
        )

which works! But I don't see how, since it all it seems to do is write to a file on the AI platform worker at /tmp/hypertune/*. There is nothing in the Google Cloud documentation that explains how this is getting picked up by the AI platform...

Am I missing something in order to get tf.summary.scalar events to be displayed?

like image 265
Julian Ferry Avatar asked Apr 28 '20 12:04

Julian Ferry


People also ask

How do you supply Hyperparameters to the training job?

Specify the hyperparameter tuning configuration for your training job. Create a HyperparameterSpec object to hold the hyperparameter tuning configuration for your training job, and add the HyperparameterSpec as the hyperparameters object in your TrainingInput object. The hyperparameter tuning job will create trial jobs ...

What is AI platform training?

AI Platform Training brings the power and flexibility of TensorFlow, scikit-learn, XGBoost, and custom containers to the cloud. You can use AI Platform Training to train your machine learning models using the resources of Google Cloud.


1 Answers

I am having the same issue that I can't get AI platform to pick up tf.summary.scalar. I tried to debug it with the GCP support and AI Platform Engineering team for the last 2 months. They didn't manage to reproduce the issue even if we were using almost the same code. We even did one coding session but were still having different results.

Recommendation from the GCP AI Platform Engineering team: "don't use tf.summary.scalar" the main reason is that by using the other method:

  • it works fine for everybody
  • you can control and see what happen (not a blackbox)

They will update the documentation to reflect this new recommendation.

Setup:

  • Tensoflow 2.2.0
  • TensorBoard 2.2.2
  • keras model is created within the tf.distribute.MirroredStrategy() scope
  • keras callback for TensorBoard

With the following setup the "issue" is observed:

  • when using TensorBoard with update_freq='epoch' and with 1 epoch only

It seems to work with other setup. Anyway I will follow the recommendation from GCP and use the custom solution to avoid issue

enter image description here

like image 193
Dr. Fabien Tarrade Avatar answered Oct 12 '22 21:10

Dr. Fabien Tarrade