(Note: I have also asked this question here)
I have been trying to get Google Cloud's AI platform to display the accuracy of a Keras model, trained on the AI platform. I configured the hyperparameter tuning with hptuning_config.yaml
and it works. However I can't get AI platform to pick up tf.summary.scalar
calls during training.
I have been following the following documentation pages:
1. Overview of hyperparameter tuning
2. Using hyperparameter tuning
According to [1]:
How AI Platform Training gets your metric You may notice that there are no instructions in this documentation for passing your hyperparameter metric to the AI Platform Training training service. That's because the service monitors TensorFlow summary events generated by your training application and retrieves the metric."
And according to [2], one way of generating such a Tensorflow summary event is by creating a callback class as so:
class MyMetricCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
tf.summary.scalar('metric1', logs['RootMeanSquaredError'], epoch)
So in my code I included:
# hptuning_config.yaml
trainingInput:
hyperparameters:
goal: MAXIMIZE
maxTrials: 4
maxParallelTrials: 2
hyperparameterMetricTag: val_accuracy
params:
- parameterName: learning_rate
type: DOUBLE
minValue: 0.001
maxValue: 0.01
scaleType: UNIT_LOG_SCALE
# model.py
class MetricCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs):
tf.summary.scalar('val_accuracy', logs['val_accuracy'], epoch)
I even tried
# model.py
class MetricCallback(tf.keras.callbacks.Callback):
def __init__(self, logdir):
self.writer = tf.summary.create_file_writer(logdir)
def on_epoch_end(self, epoch, logs):
with writer.as_default():
tf.summary.scalar('val_accuracy', logs['val_accuracy'], epoch)
Which successfully saved the 'val_accuracy' metric to Google storage (I can also see this with TensorBoard). But this does not get picked up by the AI platform, despite the claim made in [1].
Using the Cloud ML Hypertune package, I created the following class:
# model.py
class MetricCallback(tf.keras.callbacks.Callback):
def __init__(self):
self.hpt = hypertune.HyperTune()
def on_epoch_end(self, epoch, logs):
self.hpt.report_hyperparameter_tuning_metric(
hyperparameter_metric_tag='val_accuracy',
metric_value=logs['val_accuracy'],
global_step=epoch
)
which works! But I don't see how, since it all it seems to do is write to a file on the AI platform worker at /tmp/hypertune/*
. There is nothing in the Google Cloud documentation that explains how this is getting picked up by the AI platform...
Am I missing something in order to get tf.summary.scalar
events to be displayed?
Specify the hyperparameter tuning configuration for your training job. Create a HyperparameterSpec object to hold the hyperparameter tuning configuration for your training job, and add the HyperparameterSpec as the hyperparameters object in your TrainingInput object. The hyperparameter tuning job will create trial jobs ...
AI Platform Training brings the power and flexibility of TensorFlow, scikit-learn, XGBoost, and custom containers to the cloud. You can use AI Platform Training to train your machine learning models using the resources of Google Cloud.
I am having the same issue that I can't get AI platform to pick up tf.summary.scalar. I tried to debug it with the GCP support and AI Platform Engineering team for the last 2 months. They didn't manage to reproduce the issue even if we were using almost the same code. We even did one coding session but were still having different results.
Recommendation from the GCP AI Platform Engineering team: "don't use tf.summary.scalar" the main reason is that by using the other method:
They will update the documentation to reflect this new recommendation.
Setup:
With the following setup the "issue" is observed:
It seems to work with other setup. Anyway I will follow the recommendation from GCP and use the custom solution to avoid issue
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With