I have a very simple ECS container which listens on port 5000 and writes out HelloWorld, plus the hostname of the instance it is running on. I want to deploy many of these containers using ECS and load balance them just to really learn more about how this works. And it is working to a certain extent but my health check is failing (time out) which is causing the containers tasks to be bounced up and down.
Each private subnet has a rule to 10.0.0.0/19, and a default route for 0.0.0.0/0 to the NAT instance in public subnet in the same AZ as it.
Each public subnet has the same 10.0.0.0/19 route and a default route for 0.0.0.0/0 to the internet gateway.
My instances are in a group that allows egress to anywhere and ingress on ports 32768 - 65535 from the security group the ALB is in.
The ALB is in a security group that allows ingress on port 80 only but egress to the security group my ECS instances are in on any port/protocol
When I bring all this up, it actually works - I can take the public dns record of the ALB and refresh and I see responses coming back to me from my container app telling me the hostname. This is exactly what I want to achieve however, it fails the health check and the container is drained, and replaced - with another one that fails the health check. This continues in a cycle, I have never seen a single successful health check.
I can view my application correctly load-balanced across AZs and private subnets on both its main url '/' and its health check url '/health' through the load balancer, from the ECS instance itself, and by using the instances private IP and port from another machine within the public subnet the ALB is in. The ECS service just cannot see this health check once without timing out. What on earth could I be missing??
An instance might fail the ELB health check because an application running on the instance has issues that cause the load balancer to consider the instance out of service.
If an instance fails these status checks, it is marked unhealthy and is terminated while Amazon EC2 Auto Scaling launches a new replacement instance. You can attach one or more load balancer target groups, one or more Classic Load Balancers, or both to your Auto Scaling group.
You configured ELB to perform health checks on these EC2 instances, if an instance fails to pass health checks, which statement will be true? The instance gets quarantined by the ELB for root cause analysis.
For any that follow, I managed to break the app in my container accidentally and it was throwing a 500 error. Crucially though, the health check started reporting this 500 error -> therefore it was NOT a network timeout. Which means that when the health-check contacts the end point in my app, it was not handling the response properly and this appears to be a problem related to Nancy (the api framework I was using) and Go which sometimes reports "Client.Timeout exceeded while awaiting headers" and I am sure ECS is interpreting this as a network time-out. I'm going to tcpdump the network traffic and see what the health-check is sending and Nancy is responding and compare that to a container that works. Perhaps there is a Nancy fix or maybe ECS needs to not be so fussy.
edit:
By simply updating all the nuget packages that my Nancy app was using to the latest available and suddenly everything started working!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With