I am using Amazon SageMaker to train a model with a lot of data. This takes a lot of time - hours or even days. During this time, I would like be able to query the trainer and see its current status, particularly:
One way to do this is to explicitly tell the trainer to print debug messages after each iteration. However, these messages will be availble only at the console from which I run the trainer. Since training takes so much time, I would like to be able to query the trainer status remotely, from different computers.
Is there a way to remotely query the status of a running trainer?
All logs are available in Amazon Cloudwatch. You can query CloudWatch programmatically or via an API to parse the logs.
Are you using built-in algorithms or a Framework like MXNet or TensorFlow? For TensorFlow you can monitor your job with TensorBoard.
Additionally, you can see high level job status using the describe training job API call:
import sagemaker
sm_client = sagemaker.Session().sagemaker_client
print(sm_client.describe_training_job(TrainingJobName='You job name here'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With