AWS SageMaker: CapacityError: Unable to provision requested ML compute capacity.

Question

We were running two TrainingJob instances of type (1) ml.p3.8xlarge and (2) ml.p3.2xlarge .

Each training job is running a custom algorithm with Tensorflow plus a Keras backend.

The instance (1) is running ok, while the instance (2) after a reported time of training of 1 hour, with any logging in CloudWatch (any text tow log), exits with this error:

Failure reason
CapacityError: Unable to provision requested ML compute capacity. Please retry using a different ML instance type.

I'm not sure what this message mean.

Harish Panwar · Accepted Answer

This message mean SageMaker tried to launch the instance but EC2 was not having enough capacity of this instance hence after waiting for some time(in this case 1 hour) SageMaker gave up and failed the training job.

For more information about capacity issue from ec2, please visit: troubleshooting-launch-capacity

To solve this, you can either try running jobs with different instance type as suggested in failure reason or wait a few minutes and then submit your request again as suggested by EC2.

AWS SageMaker: CapacityError: Unable to provision requested ML compute capacity.

Tags:

tensorflow

keras

amazon-sagemaker

loretoparisi

1 Answers

Harish Panwar

Recent Activity

Donate For Us

AWS SageMaker: CapacityError: Unable to provision requested ML compute capacity.

Tags:

tensorflow

keras

amazon-sagemaker

loretoparisi

1 Answers

Harish Panwar

Related questions

Recent Activity

Donate For Us