I have an AWS ECS service who's deployment has been in progress for more than an hour? What can I do to get it to complete?
I have the following for the service deployment options:
Minimum healthy percent 100
Maximum percent 200
Deployments get stuck typically because the new tasks for that deployment can fail to become healthy.
Here's a few places to look at:
- Service events. This would normally show any issues launching tasks or tasks getting stopped by the service.
- Tasks status when stopped. When tasks actually get launched but they get stopped. If it's stopped or failed to launch due to an error, it'll be shown on the status.
- Task logs. Out of the box they go nowhere so if not configured you won't see any logs by default. If using EC2 you can login to the EC2 instance, you can see the logs via Docker logs, otherwise you have to either configure on the task definition log configuration or the Docker Daemon when using EC2 instances. Note that if you do configure awslogs log driver, you also need to make sure your container execution role allows
logs:CreateLogStream and logs:PutLogEvents permissions, otherwise no logs would show up.
- Some applications require a rather large amount of startup CPU or Memory so if you provision the task with less than what they require, they would get stuck forever and be very slow to even startup. They could fail to become healthy in time for ECS or load balancer in the grace period and then ECS would stop them as they are not healthy in time. You can verify this by checking the service CPU and memory utilisation metrics.
- Occasionally any of the above don't give you any insights, but the container does run, I would try to run troubleshooting commands on the container using ECS Exec or run it locally. I typically check if the process is listening to the port via netstat or checking process activity via strace or sysdig, environment variables, process output and files.
- Check ECS troubleshooting page.