Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Watching over SageMaker while it is training

I am using Amazon SageMaker to train a model with a lot of data. This takes a lot of time - hours or even days. During this time, I would like be able to query the trainer and see its current status, particularly:

  • How many iterations it already did, and how many iterations it still needs to do? (the training algorithm is deep learning - it is based on iterations).
  • How much time does it need to complete the training?
  • Ideally, I would like to classify a test-sample using the model of the current iteration, to see its current performance.

One way to do this is to explicitly tell the trainer to print debug messages after each iteration. However, these messages will be availble only at the console from which I run the trainer. Since training takes so much time, I would like to be able to query the trainer status remotely, from different computers.

Is there a way to remotely query the status of a running trainer?

like image 700
Erel Segal-Halevi Avatar asked Mar 05 '26 05:03

Erel Segal-Halevi


1 Answers

All logs are available in Amazon Cloudwatch. You can query CloudWatch programmatically or via an API to parse the logs.

Are you using built-in algorithms or a Framework like MXNet or TensorFlow? For TensorFlow you can monitor your job with TensorBoard.

Additionally, you can see high level job status using the describe training job API call:

import sagemaker
sm_client = sagemaker.Session().sagemaker_client
print(sm_client.describe_training_job(TrainingJobName='You job name here'))
like image 82
Gili Nachum Avatar answered Mar 08 '26 01:03

Gili Nachum