The EC2 instances in my AWS autoscaling group all terminate after 1-4 hours hours of running. The exact time varies, but when it happens, the entire group goes down within minutes of each other.
The scaling history description for each is simply:
At 2016-08-26T05:21:04Z an instance was taken out of service in response to a EC2 health check indicating it has been terminated or stopped.
But I haven't added any health checks. And the EC2 status checks all pass for the life of the instance.
How do I determine what this "health check" failure actually means?
Most questions around ASG termination all lead back to the load balancer, but I have no load balancer. This cluster processes batch jobs, and min/max/desired values are controlled by software based on workload backlog elsewhere in the system.
The ASG history does not indicate a scale-in event, AND the instances are also all protected from scale-in explicitly.
I tried setting the health check grace period to 20 hours to see if that at least leaves the instance up so I can inspect it, but they all still terminate.
The instances are running an ECS AMI, and ECS is running a single task, started at bootup, in a container. The logs from that task look normal, and things seem to be running happily until a few minutes before the instance vanishes.
The task is CPU intensive, but error occurs still when I just have it sleep for six hours.
Here are few suggestions:
In Target Group section, verify Health checks and Targets (Registered targets and their Status, and Health of the Availability Zones.
To modify health check settings for a target group using the AWS Console, choose Target Groups, and edit Health checks.
In ASG (EC2's Auto Scaling group), check Details (for Termination Policies), Activity History (for termination messages), Instances (for their Health Status), Scheduled Actions and Scaling Policies.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With