I have an auto-scaling group which launches queue processing instances. These instances are Windows based. Normally we just need one but when our backlog grows too large I want to be able to automatically launch more to deal with the load so our users have a good experience. Right now, the number of desired nodes is manually set but I would like to automate this with a cloudwatch alarm in the future.
When a new instance is requested, it downloads its configuration from Chef and successfully launches, and I know this by looking at the logs, showing a successful Chef run. It joins the other instances and begins consuming messages from the queue. However 10 minutes after it is launched, it is terminated because the instance "failed to launch" due to a heartbeat timeout. It then attempts to launch a new instance and the cycle continues.
When the instance launches it is stuck in the "Pending:Wait" state. Unlike my web server auto-scaling group, it never leaves this state until it is terminated later. The two instances are roughly the same, only this doesn't run a web server.
I have tried adjusting the health check grace period and the cooldown period to 1500 seconds but the instance is always terminated inside 10 minutes (sometimes 11). I also tried adding "HealthCheck" and "AddToLoadBalancer" to the list of suspended processes but this did not appear to have an effect.
I have also tried manually setting the health of the instance using Set-ASInstanceHealth (or aws autoscaling set-instance-health
for those who know the CLI version). This had no effect either.
I do have one instance launched by the autoscaling group so somehow it was at one point able to launch instances. I assume that the issue lies with the heartbeat problem but I do not understand what sends it and I cannot find any documentation on this.
My guess is that somewhere there is a flag I need to set when the instance has finished launching and the software on it is configured correctly. Instances that associate to an ELB already have this because they have a functioning web server but instances that don't listen on any ports need something extra. This is the only difference I can see between this and other autoscaling groups.
Update 17th September 2017 - you can now see lifecycle hooks in the management console so you don't need to use the API calls below if you don't want to.
I have successfully resolved the issue, with the assistance of some Amazon employees on the AWS forums. Unfortunately, as I was not aware of the root cause at the time, I was unable to fill the question with some details that would have helped someone resolve the issue.
The issue was that I had two lifecycle hooks defined for the autoscaling group. These hooks are responsible for triggering a CodeDeploy deployment when a new instance launches. Once the deployment is successful, the hook is marked as succeeded and the instance is moved to the "InService" state. If the hook is never marked as succeeded, the autoscaling system decides that the instance failed to launch and terminates it.
I used the Powershell API to list my lifecycle hooks:
PS> Get-ASLifecycleHooks -AutoScalingGroupName "Production Background Server";
AutoScalingGroupName : Production Background Server
DefaultResult : CONTINUE
GlobalTimeout : 150000
HeartbeatTimeout : 1500
LifecycleHookName : CodeDeploy-managed-automatic-launch-deployment-hook-Production-cdf28f52-84dc-48ca-9c25-xxxxxxxxxxxx
LifecycleTransition : autoscaling:EC2_INSTANCE_LAUNCHING
NotificationMetadata : 03ff305d-be5e-48a8-bc85-xxxxxxxxxxxxx
NotificationTargetARN : arn:aws:sqs:eu-west-1:xxxxxxxxxxxxxx:razorbill-eu-west-1-prod-default-autoscaling-lifecycle-hook
RoleARN :
AutoScalingGroupName : Production Background Server
DefaultResult : CONTINUE
GlobalTimeout : 150000
HeartbeatTimeout : 1500
LifecycleHookName : CodeDeploy-managed-automatic-launch-deployment-hook-Production-f6bda6f3-d4f3-4a73-a6ca-xxxxxxxxxxxxx
LifecycleTransition : autoscaling:EC2_INSTANCE_LAUNCHING
NotificationMetadata : 03ff305d-be5e-48a8-bc85-xxxxxxxxxxxx
NotificationTargetARN : arn:aws:sqs:eu-west-1:xxxxxxxxxxxxxx:razorbill-eu-west-1-prod-default-autoscaling-lifecycle-hook
RoleARN :
I saw that I had two hooks with the same notification metadata. I assumed one must be redundant and I removed one. The next launch I attempted succeeded.
My theory is that because both hooks had the same notification metadata, it was not possible for both hooks to be marked as succeeded. Therefore, one of the two would always time out, causing a heartbeat timeout.
I have no idea how this extra hook was defined, but I assume that this is a bug in the CodeDeploy user interface. Anyway, I'm glad that this issue is now resolved.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With