How do I find the cause of an EC2 autoscaling group "health check" failure? (no load balancer involved)

Question

The EC2 instances in my AWS autoscaling group all terminate after 1-4 hours hours of running. The exact time varies, but when it happens, the entire group goes down within minutes of each other.

The scaling history description for each is simply:

At 2016-08-26T05:21:04Z an instance was taken out of service in response to a EC2 health check indicating it has been terminated or stopped.

But I haven't added any health checks. And the EC2 status checks all pass for the life of the instance.

How do I determine what this "health check" failure actually means?

Most questions around ASG termination all lead back to the load balancer, but I have no load balancer. This cluster processes batch jobs, and min/max/desired values are controlled by software based on workload backlog elsewhere in the system.

The ASG history does not indicate a scale-in event, AND the instances are also all protected from scale-in explicitly.

I tried setting the health check grace period to 20 hours to see if that at least leaves the instance up so I can inspect it, but they all still terminate.

The instances are running an ECS AMI, and ECS is running a single task, started at bootup, in a container. The logs from that task look normal, and things seem to be running happily until a few minutes before the instance vanishes.

The task is CPU intensive, but error occurs still when I just have it sleep for six hours.

kenorb · Accepted Answer

Here are few suggestions:

To see why instance was terminated, in EC2's Instance list select terminated instance, and select Get System Log in Instance Settings (menu), then scroll down to the bottom to see any obvious issues. The logs are kept for a while after instance is terminated.
In ECS cluster within your active service, check Events tab for any messages.
In Target Group section, verify Health checks and Targets (Registered targets and their Status, and Health of the Availability Zones.

To modify health check settings for a target group using the AWS Console, choose Target Groups, and edit Health checks.
In ASG (EC2's Auto Scaling group), check Details (for Termination Policies), Activity History (for termination messages), Instances (for their Health Status), Scheduled Actions and Scaling Policies.
Check CloudWatch for any available logs.
Check CloudTrail for any suspicious events.
Verify that ECS agents are connected: Why is my Amazon ECS agent listed as disconnected?
Check also: Health Checks for Your Target Groups and Amazon ECS Troubleshooting.
For more suggestions, check: terraform-ecs. Registered container instance is showing 0

How do I find the cause of an EC2 autoscaling group "health check" failure? (no load balancer involved)

Tags:

amazon-web-services

amazon-ec2

Scott Smith

1 Answers

kenorb

Recent Activity

Donate For Us

How do I find the cause of an EC2 autoscaling group "health check" failure? (no load balancer involved)

Tags:

amazon-web-services

amazon-ec2

Scott Smith

1 Answers

kenorb

Related questions

Recent Activity

Donate For Us