I am creating an AWS ECS service using Cloudformation.
Everything seems to complete successfully, I can see the instance being attached to the load-balancer, the load-balancer is declaring the instance as being healthy, and if I hit the load-balancer I am successfully taken to my running container.
Looking at the ECS control panel, I can see that the service has stabilised, and that everything is looking OK. I can also see that the container is stable, and is not being terminated/re-created.
However, the Cloudformation template never completes, it is stuck in CREATE_IN_PROGRESS
until about 30-60 minutes later, when it rolls back claiming that the service did not stabilise. Looking at CloudTrail, I can see a number of RegisterInstancesWithLoadBalancer
instantiated by ecs-service-scheduler
, all with the same parameters i.e. same instance id and load-balancer. I am using standard IAM roles and permissions for ECS, so it should not be a permissions issue.
Anyone had a similar issue?
A CloudFormation stack gets stuck in the UPDATE_ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS or UPDATE_COMPLETE_CLEANUP_IN_PROGRESS state when: CloudFormation is still in the process of removing old resources, or can't remove those resources due to a dependency issue.
Specify the disable-rollback option during an execute-change-set operation. Provide a stack name and template to the execute-change-set command with the disable-rollback option. The command returns the following output. Initiate the change set with disable-rollback option.
Because AWS CloudFormation doesn't know the database was deleted, it assumes that the database instance still exists and attempts to roll back to it, causing the update rollback to fail. Depending on the cause of the failure, you can manually fix the error and continue the rollback.
Your AWS::ECS::Service
needs to register the full ARN for the TaskDefinition
(Source: See the answer from ChrisB@AWS on the AWS forums). The key thing is to set your TaskDefinition
with the full ARN, including revision. If you skip the revision (:123
in the example below), the latest revision is used, but CloudFormation still goes out to lunch with "CREATE_IN_PROGRESS" for about an hour before failing. Here's one way to do that:
"MyService": { "Type": "AWS::ECS::Service", "Properties": { "Cluster": { "Ref": "ECSClusterArn" }, "DesiredCount": 1, "LoadBalancers": [ { "ContainerName": "myContainer", "ContainerPort": "80", "LoadBalancerName": "MyELBName" } ], "Role": { "Ref": "EcsElbServiceRoleArn" }, "TaskDefinition": { "Fn::Join": ["", ["arn:aws:ecs:", { "Ref": "AWS::Region" }, ":", { "Ref": "AWS::AccountId" }, ":task-definition/my-task-definition-name:123"]]} } } }
Here's a nifty way to grab the latest revision of MyTaskDefinition
via the aws cli and jq:
aws ecs list-task-definitions --family-prefix MyTaskDefinition | jq --raw-output .taskDefinitionArns[0][-1:]
I found another related scenario that will cause this and thought I'd put it here in case anyone else runs into it. If you define a TaskDefinition
with an Image that doesn't actually exist in its ContainerDefinition
and then you try to run that TaskDefinition
as a Service, you'll run into the same hang issue (or at least something that looks like the same issue).
NOTE: The example YAML chunks below were all in the same CloudFormation template
So as an example, I created this Repository
:
MyRepository: Type: AWS::ECR::Repository
And then I created this Cluster
:
MyCluster: Type: AWS::ECS::Cluster
And this TaskDefinition
(abridged):
MyECSTaskDefinition: Type: AWS::ECS::TaskDefinition Properties: # ... ContainerDefinitions: # ... Image: !Join ["", [!Ref "AWS::AccountId", ".dkr.ecr.", !Ref "AWS::Region", ".amazonaws.com/", !Ref MyRepository, ":1"]] # ...
With those defined, I went to create a Service
like this:
MyECSServiceDefinition: Type: AWS::ECS::Service Properties: Cluster: !Ref MyCluster DesiredCount: 2 PlacementStrategies: - Type: spread Field: attribute:ecs.availability-zone TaskDefinition: !Ref MyECSTaskDefinition
Which all seemed sensible to me, but it turns out there two issues with this as written/deployed that caused it to hang.
DesiredCount
is set to 2 which means it will actually try to spin up the service and run it, not just define it. If I set DesiredCount
to 0, this works just fine.Image
defined in MyECSTaskDefinition
doesn't exist yet. I made the repository as part of this template, but I didn't actually push anything to it. So when the MyECSServiceDefinition
tried to spin up the DesiredCount
of 2 instances, it hung because the image wasn't actually available in the repository (because the repository literally just got created in the same template).So, for now, the solution is to create the CloudFormation stack with a DesiredCount
of 0 for the Service
, upload the appropriate Image
to the repository and then update the CloudFormation stack to scale up the service. Or alternately, have a separate template that sets up core infrastructure like the repository, upload builds to that and then have a separate template to run that sets up the Services
themselves.
Hope that helps anyone having this issue!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With