Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS SageMaker: CapacityError: Unable to provision requested ML compute capacity.

We were running two TrainingJob instances of type (1) ml.p3.8xlarge and (2) ml.p3.2xlarge .

Each training job is running a custom algorithm with Tensorflow plus a Keras backend.

The instance (1) is running ok, while the instance (2) after a reported time of training of 1 hour, with any logging in CloudWatch (any text tow log), exits with this error:

Failure reason
CapacityError: Unable to provision requested ML compute capacity. Please retry using a different ML instance type.

I'm not sure what this message mean.

like image 586
loretoparisi Avatar asked Oct 28 '25 03:10

loretoparisi


1 Answers

This message mean SageMaker tried to launch the instance but EC2 was not having enough capacity of this instance hence after waiting for some time(in this case 1 hour) SageMaker gave up and failed the training job.

For more information about capacity issue from ec2, please visit: troubleshooting-launch-capacity

To solve this, you can either try running jobs with different instance type as suggested in failure reason or wait a few minutes and then submit your request again as suggested by EC2.

like image 184
Harish Panwar Avatar answered Oct 30 '25 07:10

Harish Panwar