Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS AutoScalingGroup HealthCheckType 'ELB' considers instance "InService" prematurely

I'm trying to get AutoScalingRollingUpdate to work on my autoscaling group, by bringing online new instances, then only once the new instance(s) are accepting traffic, terminating the old instances. It seems like AutoScalingRollingUpdate is designed for this purpose.

I have the HealthCheckType of my AutoScalingGroup set to 'ELB'. I also have the HealthCheck on the ELB set to require:

  • 3 successful requests to / for "healthy"
  • 10 unsuccessful requests to / for "unhealthy"
  • no grace period (zero, 0)

Now, from the ELB's perspective, when new instances come online, they are not InService for several minutes, which is what I expect. However, from the AutoScalingGroup's perspective, they are almost immediately being considered InService, and as such, my AutoScalingGroup is taking healthy instances out of service before the new instances are actually ready to receive traffic. I'm confused why the ASG thinks the instances are healthy before the ELB does, when the HealthCheckType is explicitly set to 'ELB'.

I've tried setting a grace period, but this doesn't change anything at all. In fact, I removed the grace period of 300 seconds because I thought maybe instances were implicitly "InService" during the grace period or something.

I know I can set a PauseTime on the rolling update policy, but that is fragile, because sometimes failures happen when instances come online and they get nuked and replaced before they ever finish provisioning, so sometimes, the PauseTime window may be exceeded. Also, I'd like to minimize the amount of time my app is running two different versions at the same time.

    ... ELB stuff ...

    "HealthCheck": {
      "HealthyThreshold": "3",
      "UnhealthyThreshold": "10",
      "Interval": "30",
      "Timeout": "15",
      "Target": {
        "Fn::Join": [
          "",
          [
            {"Fn::Join": [":", ["HTTP", {"Ref": "hostPort"}]]},
            {"Ref": "healthCheckPath"}
          ]
        ]
      }
    },

   ... ASG Stuff ...

  {
    ... snip ...

    "HealthCheckType": "ELB",
    "HealthCheckGracePeriod": "0",
    "Cooldown": "300"
  },
  "UpdatePolicy" : {
    "AutoScalingRollingUpdate" : {
      "MinInstancesInService" : "1",
      "MaxBatchSize" : "1"
    }
  }
like image 844
d11wtq Avatar asked Nov 25 '14 07:11

d11wtq


People also ask

What is health Check grace period in AWS?

By default, the health check grace period is 300 seconds when you create an Auto Scaling group from the AWS Management Console. Its default value is 0 seconds when you create an Auto Scaling group using the AWS CLI or an SDK.

What actions will be taken if an instance fails the health checks?

If an instance fails these status checks, it is marked unhealthy and is terminated while Amazon EC2 Auto Scaling launches a new replacement instance. You can attach one or more load balancer target groups, one or more Classic Load Balancers, or both to your Auto Scaling group.

When an instance is unhealthy it is terminated and replaced with a new one?

When an instance is unhealthy, it is terminated and replaced with a new one, which of the following services does that? Answer B. When ELB detects that an instance is unhealthy, it starts routing incoming traffic to other healthy instances in the region.


1 Answers

First, from our experience with CloudFormation the ASG HealthCheckType and HealthCheckGracePeriod are leveraged primarily outside the scope of CloudFormation events. These properties come into play anytime a new instance is added to the ASG. This can be during a CloudFormation update, but also during Auto Scaling events or during a self-healing event. In the latter cases it is important to set the HealthCheckGracePeriod to a value that gives the new instance sufficient time to come online before considering the ELB health checks.

It seems the capability you are most interested in is the UpdatePolicy that is invoked when you run a CloudFormation update with a modified Launch Configuration. The magic property is the WaitOnResourceSignals which forces the ASG to wait for a success signal before considering the update a success.

  "UpdatePolicy" : {
    "AutoScalingRollingUpdate" : {
      "MinInstancesInService" : "1",
      "MaxBatchSize" : "1",
      "PauseTime" : "PT15M",
      "WaitOnResourceSignals" : "true"
    }
  },

When the WaitOnResourceSignals property is set to true, the PauseTime property becomes a timeout. If the ASG does not receive a signal within the PauseTime of 15 minutes, the update is considered a failure and the new instance is terminated. As soon as the ASG receives a success signal, the ASG health check comes into play, unless the HealthCheckGracePeriod has not yet expired. We typically set the HealthCheckGracePeriod to the same value as the PauseTime. This ensures that we never begin using the ELB health check before the instance has had a chance to send a signal or reach the PauseTime timeout. http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html

Typically, a success signal is sent to the ASG following the cfn-init bootstrapping script from within the UserData of the ASG Launch Configuration.

"UserData"       : { "Fn::Base64" : { "Fn::Join" : ["", [
     "#!/bin/bash -xe\n",
     "yum update -y aws-cfn-bootstrap\n",

     "/opt/aws/bin/cfn-init -v ",
     "         --stack ", { "Ref" : "AWS::StackName" },
     "         --resource LaunchConfig ",
     "         --configsets full_install ",
     "         --region ", { "Ref" : "AWS::Region" }, "\n",

     "/opt/aws/bin/cfn-signal -e $? ",
     "         --stack ", { "Ref" : "AWS::StackName" },
     "         --resource WebServerGroup ",
     "         --region ", { "Ref" : "AWS::Region" }, "\n"
]]}}

This is sufficient for many cases, but sometimes the instance may still not be ready when we send the success signal back to the ASG. For example, we may want to wait on a background process to load data or wait for our application server to start. This is especially true if our ELB health check targets a URL that requires our application to be running. In these cases we want to delay the success signal until our instance is ready. Here is an example of how to create a Launch Configuration configSet to delay the signal until the ELB API returns an "InService" status for the instance.

  "verify_instance_health" : {
    "commands" : {
      "ELBHealthCheck" : {
        "command" : { "Fn::Join" : ["", [ 
          "until [ \"$state\" == \"\\\"InService\\\"\" ]; do ",
          "  state=$(aws --region ", { "Ref" : "AWS::Region" }, " elb describe-instance-health ",
          "              --load-balancer-name ", { "Ref" : "ElasticLoadBalancer" }, 
          "              --instances $(curl -s http://169.254.169.254/latest/meta-data/instance-id) ",
          "              --query InstanceStates[0].State); ",
          "  sleep 10; ",
          "done"
        ]]}
      }
    }
  }

See this discussion forum for more information and a complete example using the ELB health check - https://forums.aws.amazon.com/ann.jspa?annID=2741

Note: These examples also require that you use the ASG CreationPolicy attribute to receive the signals during ASG creation. In the past, the WaitCondition and WaitConditionHandle resources were used to receive signals, but these are no longer recommended. The Count attribute is the number of signals that should be received at creation. This value should equal the ASG MinSize number.

  "CreationPolicy" : {
    "ResourceSignal" : {
      "Timeout" : "PT15M",
      "Count"   : "2"
    }
  },

http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-creationpolicy.html

like image 152
Jason Avatar answered Oct 18 '22 16:10

Jason