We were running two TrainingJob instances of type (1) ml.p3.8xlarge and (2) ml.p3.2xlarge .
Each training job is running a custom algorithm with Tensorflow plus a Keras backend.
The instance (1) is running ok, while the instance (2) after a reported time of training of 1 hour, with any logging in CloudWatch (any text tow log), exits with this error:
Failure reason
CapacityError: Unable to provision requested ML compute capacity. Please retry using a different ML instance type.
I'm not sure what this message mean.
This message mean SageMaker tried to launch the instance but EC2 was not having enough capacity of this instance hence after waiting for some time(in this case 1 hour) SageMaker gave up and failed the training job.
For more information about capacity issue from ec2, please visit: troubleshooting-launch-capacity
To solve this, you can either try running jobs with different instance type as suggested in failure reason or wait a few minutes and then submit your request again as suggested by EC2.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With