Watching over SageMaker while it is training

Question

I am using Amazon SageMaker to train a model with a lot of data. This takes a lot of time - hours or even days. During this time, I would like be able to query the trainer and see its current status, particularly:

How many iterations it already did, and how many iterations it still needs to do? (the training algorithm is deep learning - it is based on iterations).
How much time does it need to complete the training?
Ideally, I would like to classify a test-sample using the model of the current iteration, to see its current performance.

One way to do this is to explicitly tell the trainer to print debug messages after each iteration. However, these messages will be availble only at the console from which I run the trainer. Since training takes so much time, I would like to be able to query the trainer status remotely, from different computers.

Is there a way to remotely query the status of a running trainer?

Gili Nachum · Accepted Answer

All logs are available in Amazon Cloudwatch. You can query CloudWatch programmatically or via an API to parse the logs.

Are you using built-in algorithms or a Framework like MXNet or TensorFlow? For TensorFlow you can monitor your job with TensorBoard.

Additionally, you can see high level job status using the describe training job API call:

import sagemaker
sm_client = sagemaker.Session().sagemaker_client
print(sm_client.describe_training_job(TrainingJobName='You job name here'))

Watching over SageMaker while it is training

Tags:

machine-learning

amazon-sagemaker

Erel Segal-Halevi

1 Answers

Gili Nachum

Recent Activity

Donate For Us

Watching over SageMaker while it is training

Tags:

machine-learning

amazon-sagemaker

Erel Segal-Halevi

1 Answers

Gili Nachum

Related questions

Recent Activity

Donate For Us