Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS AutoScaling 'oldestinstance' Termination Policy does not always terminate oldest instances

Scenario

I am creating a script that will launch new instances into an AutoScaling Group and then remove the old instances. The purpose is to introduce newly created (or updated) AMI's to the AutoScaling Group. This is accomplished by increasing the Desired capacity by double the current number of instances. Then, after the new instances are Running, decreasing the Desired capacity by the same number.

Problem

When I run the script, I watch the group capacity increase by double, the new instances come online, they reach the Running state, and then the group capacity is decreased. Works like a charm. The problem is that SOMETIMES the instances that are terminated by the decrease are actually the new ones instead of the older ones.

Question

How can I ensure that the AutoScaling Group will always terminate the Oldest Instance?

Settings

  • The AutoScaling Group has the following Termination Polices: OldestInstance, OldestLaunchConfiguration. The Default policy has been removed.
  • The Default Cooldown is set to 0 seconds.
  • The Group only has one Availability Zone.

Troubleshooting

  • I played around with the Cooldown setting. Ended up just putting it on 0.
  • I waited different lengths of time to see if the existing servers needed to be running for a certain amount of time before they would be terminated. It seems that if they are less than 5 minutes old, they are less likely to be terminated, but not always. I had servers that were 20 minutes old that were not terminated instead of the new ones. Perhaps newly launched instances have some termination protection grace period?

Concession

I know that in most cases, the servers I will be replacing will have been running for a long time. In production, this might not be an issue. Still, it is possible that during the normal course of AutoScaling, an older server will be left running instead of a newer one. This is not an acceptable way to operate.

I could force specific instances to terminate, but that would defeat the point of the OldestInstance Termination Policy.

Update: 12 Feb 2014 I have continued to see this in production. Instances with older launch configs that have been running for weeks will be left running while newer instances will be terminated. At this point I am considering this to be a bug. A thread at Amazon was opened for this topic a couple years ago, apparently without resolution.

Update: 21 Feb 2014 I have been working with AWS support staff and at this point they have preliminarily confirmed it could be a bug. They are researching the problem.

like image 987
SunSparc Avatar asked Oct 20 '22 16:10

SunSparc


1 Answers

It doesn't look like you can, precisely, because auto-scaling is trying to do one other thing for you in addition to having the correct number of instances running: keep your instance counts balanced across availability zones... and it prioritizes this consideration higher than your termination policy.

Before Auto Scaling selects an instance to terminate, it first identifies the Availability Zone that has more instances than the other Availability Zones used by the group. If all Availability Zones have the same number of instances, it identifies a random Availability Zone. Within the identified Availability Zone, Auto Scaling uses the termination policy to select the instance for termination.

— http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/us-termination-policy.html

If you're out of balance, then staying in balance is arguably the most sensible strategy, especially if you are using ELB. The documentation is a little ambiguous, but ELB will advertise one public IP in the DNS for each availability zone where it is configured; these three IP addresses will achieve the first tier of load balancing by virtue of round-robin DNS. If all of the availability zones where the ELB is enabled have healthy instances, then there appears to be a 1:1 correlation between which external IP the traffic hits and which availability zone's servers that traffic will be offered to by ELB -- at least that is what my server logs show. It appears that ELB doesn't route traffic across availability zones to alternate servers unless all of the servers in a given zone are detected as unhealthy, and that may be one of the justifications of why they've implemented autoscaling this way.

Although this algorithm might not always kill the oldest instance first on a region-wide basis, if it does operate as documented, it would kill off the oldest one in the selected availability zone, and at some point it should end up cycling through all of them over the course of several shifts in load... so it would not leave the oldest running indefinitely, either. The larger the number of instances in the group is, it seems like the less significant this effect should be.

like image 89
Michael - sqlbot Avatar answered Oct 23 '22 15:10

Michael - sqlbot