Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Automatic recovery from an availability zone outage?

Are there any tools or techniques available to automatically create new instances in a different availability zone in the event that an availability zone suffers an outage in Amazon Web Services/EC2?

I think I understand how to do automatic fail over in the event of an availability zone (AZ) outage, but what about automatic recovery (create new instances in a new AZ) from an outage? Is that possible?

Example scenario:

  1. We have a three-instance cluster.
  2. An ELB round-robins traffic to the cluster.
  3. We can lose any one instance, but not two instances in the cluster, and still be fully functional.
  4. Because of (3), each instance is in a different AZ. Call them AZs A, B and C.
  5. The ELB health check is configured so that the ELB can ensure each instance is healthy.
  6. Assume that one instance is lost due to an AZ outage in AZ A.

At this point the ELB will see that the lost instance is no longer responding to health checks and will stop routing traffic to that instance. All requests will go to the two remaining healthy instances. Failover is successful.

Recovery is where I am not clear. Is there a way to automatically (i.e. no human intervention) replace the lost instance in a new AZ (e.g. AZ D)? This will avoid the AZ that had the outage (A) and not use an AZ that already has an instance in it (AZs B and C).

AutoScaling Groups?

AutoScaling Groups seem like a promising place to start, but I don't know if they can deal with this use case properly.

Questions:

In an AutoScaling Group there doesn't seem to be a way to specify that the new instances that replace dead/unhealthy instances should be created in a new AZ (e.g. create it in AZ D, not in AZ A). Is this really true? In an AutoScaling Group there doesn't seem to be a way to tell the ELB to remove the failed AZ and automatically add a new AZ. Is that right?

Are these true shortcomings in AutoScaling Groups, or am I missing something?

If this can't be done with AutoScaling Groups, is there some other tool that will do this for me automatically?

In 2011 FourSquare, Reddit and others were caught by being reliant on a single availability zone (http://www.informationweek.com/cloud-computing/infrastructure/amazon-outage-multiple-zones-a-smart-str/240009598). It seems like since then tools would have come a long way. I have been surprised by the lack of automated recovery solutions. Is each company just rolling its own solution and/or doing the recovery manually? Or maybe they're just rolling the dice and hoping it doesn't happen again?

Update:

@Steffen Opel, thanks for the detailed explanation. Auto scaling groups are looking better, but I think there is still an issue with them when used with an ELB.

Suppose I create a single auto scaling group with a min, max & desired set to 3, spread across 4 AZs. Auto scaling would create 1 instance in 3 different AZs, with the 4th AZ left empty. How do I configure the ELB? If it forwards to all 4 AZs, that won't work because one AZ will always have zero instances and the ELB will still route traffic to it. This will result in HTTP 503s being returned when traffic goes to the empty AZ. I have experienced this myself in the past. Here is an example of what I saw before.

This seems to require manually updating the ELB's AZs to just those with instances running in them. This would need to happen every time auto scaling results in a different mix of AZs. Is that right, or am I missing something?

like image 898
xnickmx Avatar asked May 01 '13 02:05

xnickmx


People also ask

What happens when an availability zone goes down?

Each availability zone has independent power, cooling and networking. When an entire availability zone goes down, AWS is able to failover workloads to one of the other zones in the same region, a capability known as “Multi-AZ” redundancy.

What is AWS Auto Recovery?

Today, Amazon EC2 announces automatic recovery by default, a new feature that makes it even easier for customers to recover their instance when it becomes unreachable. Automatic recovery improves instance availability by recovering the instance if it becomes impaired due to an underlying hardware issue.

What is the maximum number of times recover action will attempt to recover the affected instance?

The automatic recovery process attempts to recover your instance for up to three separate failures per day.

Does AWS have a disaster recovery plan?

AWS Elastic Disaster Recovery Elastic Disaster Recovery enables you to use a Region in AWS Cloud as a disaster recovery target for a workload hosted on-premises or on another cloud provider, and its environment.


2 Answers

Is there a way to automatically (i.e. no human intervention) replace the lost instance in a new AZ (e.g. AZ D)?

Auto Scaling is indeed the appropriate service for your use case - to answer your respective questions:

In an AutoScaling Group there doesn't seem to be a way to specify that the new instances that replace dead/unhealthy instances should be created in a new AZ (e.g. create it in AZ D, not in AZ A). Is this really true? In an AutoScaling Group there doesn't seem to be a way to tell the ELB to remove the failed AZ and automatically add a new AZ. Is that right?

You don't have to specify/tell anything of that explicitly, it's implied in how Auto Scaling works (See Auto Scaling Concepts and Terminology) - You simply configure an Auto Scaling group with a) the number of instances you want to run (by defining the minimum, maximum, and desired number of running EC2 instances the group must have) and b) which AZs are appropriate targets for your instances (usually/ideally all AZs available in your account within a region).

Auto Scaling then takes care of a) starting the requested number of instances and b) balancing these instance in the configured AZs. An AZ outage is handled automatically, see Availability Zones and Regions:

Auto Scaling lets you take advantage of the safety and reliability of geographic redundancy by spanning Auto Scaling groups across multiple Availability Zones within a region. When one Availability Zone becomes unhealthy or unavailable, Auto Scaling launches new instances in an unaffected Availability Zone. When the unhealthy Availability Zone returns to a healthy state, Auto Scaling automatically redistributes the application instances evenly across all of the designated Availability Zones. [emphasis mine]

The subsequent section Instance Distribution and Balance Across Multiple Zones explains the algorithm further:

Auto Scaling attempts to distribute instances evenly between the Availability Zones that are enabled for your Auto Scaling group. Auto Scaling does this by attempting to launch new instances in the Availability Zone with the fewest instances. If the attempt fails, however, Auto Scaling will attempt to launch in other zones until it succeeds. [emphasis mine]

Please check the linked documentation for even more details and how edge cases are handled.

Update

Regarding your follow up question about the number of AZs being higher than the number of instances, I think you need to resort to a pragmatic approach:

You should simply select a number of AZz equal or lower than the number of instances you want to run; in case of an AZ outage, Auto Scaling will happily balance your instances across the remaining healthy AZs, which means you'd be able to survive the outage of 2 out of 3 AZs in your example and still have all 3 instances running in the remaining AZ.

Please note that while it might be intriguing to use as many AZs as are available, New customers can access three EC2 Availability Zones in US East (Northern Virginia) and two in US West (Northern California) only anyway (see Global Infrastructure), i.e. only older accounts might actually have access to all 5 AZs in us-east-1, some just 4 and newer ones 3 at most.

  • I consider this to be a legacy issue, i.e. AWS is apparently rotating older AZs out of operation. For example, even if you have access to all 5 AZs in us-east-1, some instances types might not be available in all of these in fact (e.g. the New EC2 Second Generation Standard Instances m3.xlarge and m3.2xlarge are only available in 3 out of 5 AZs in one of the accounts I'm using).

Put another way, 2-3 AZs are considered to be a fairly good compromise for fault tolerance within a region, if anything cross region fault tolerance would probably be the next thing I'd be worried about.

like image 150
Steffen Opel Avatar answered Sep 29 '22 11:09

Steffen Opel


there are many ways to solve this problem. without knowing the particulars of what your "cluster" is and how a new node comes alive, maybe registers with a master, loads data, etc, to bootstrap. for instance on hadoop, a new slave node needs to be registered with the namenode that will be serving it content. but ignoring that. just focusing on a startup of a new node.

you can use the cli tools for windows or linux instances. i fire them off from both my dev box in both os's and on the servers both os's. here is the link for linux for example:

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/setting_up_ec2_command_linux.html#set_aes_home_linux

They consist of scores of commands that you can execute at the dos or linux shell to do things like fire off an instance or terminate one. they require the configuring of environment variables like your aws credentials and the path to java. here is an example input and output for creating an instance in AvailZone=us-east-1d

sample command: ec2-request-spot-instances ami-52009e3b -p 0.02 -z us-east-1d --key DrewKP3 --group linux --instance-type m1.medium -n 1 --type one-time

sample output: SPOTINSTANCEREQUEST sir-0fd0dc32 0.020000 one-time Linux/UNIX open 2013-05-01T09:22:18-0400 ami-52009e3b m1.medium DrewKP3 linux us-east-1d monitoring-disabled

note I am being a cheap-wad and using a 2 cent Spot Instance whereby you would be using a standard instance and not spot. but then again I am creating hundreds of servers.

alright, so you have a database. for argument sake, let's say you have AWS RDS mysql, micro instance running in Multi-AvailZone mode for an extra half a cent an hr. that is is 72 cents a day. It contains a table, call it zonepref (AZ,preference). such as

us-west-1b,1

us-west-1c,2

us-west-2b,3

us-east-1d,4

eu-west-1b,5

ap-southeast-1a,6

you get the idea. The preference of zones.

there is another table in RDS that is something like "active_nodes" with columns IP addr, instance-id,zone,lastcontact,status (string,string,string,datetime,char). let's say it contains the following active nodes info:

'10.70.132.101','i-2c55bb41','us-east-1d','2013-05-01 11:18:09','A'

'10.70.132.102','i-2c66bb42','us-west-1b','2013-05-01 11:14:34','A'

'10.70.132.103','i-2c77bb43','us-west-2b','2013-05-01 11:17:17','A'

'A'=Alive and healthy, 'G'=going dead, 'D'=Dead

now your node on startup establishes either a cron job or runs a service, let's call it a server that is in any language of your liking like java or ruby. this is baked into your ami to run at startup, and on initialization it goes out and does an insert of its data into the active_nodes table so its row is there. at a minimum it runs every, say, 5 min (depending on how mission critical this whole thing is). the cron job would run at that interval or the java/ruby would create a thread that would sleep for that amount of time. when it comes to life, it grabs its ipaddr,instanceid,AZ, and makes a call to RDS to update it's row where status='A' using UTC time for lastcontact which is consistent across timezones. If it's status is not 'A' then no update will occur.

In addition it updates the status column of any other ip addr row in there that is status='A', changing it to status='G' (going dead) for any, like I said, other ipaddr that now()-lastcontact is greater than, say, 6 or 7 minutes. Additionally it can using sockets (pick a port) contact that Going Dead server and say, hey, are you there ? If so, maybe that Going Dead server merely can't access RDS tho it is in Multi-AZ but can still handle other traffic. If no contact then change the other server status to 'D'=Dead. Refine as needed.

The concept of writing the 'server' that runs on its node here is one that has a housekeeping thread that sleeps, and the main thread that will block/listen on a port. the whole thing can be written in ruby in less than 50 to 70 lines of code.

The servers can use the CLI and terminate the instance id's of other servers, but before doing so it would do something like issue a select statement from table zonepref ordered by preference for the first row that is not in active_nodes. it now has the next zone, it runs ec2-run-instances with the correct ami-id and next zone etc, passing along user data if necessary. You don't want both the Alive servers to create a new instance, so either wrap the create with a row lock in mysql or push the request onto a queue or a stack so only one of them perform it.

anyway, might seem like overkill, but i do a lot of cluster work where nodes have to talk to one another directly. Note that I am not suggesting that just because a node seems to have lost its heartbeat that its AZ has gone down :> Maybe just that instance lost its lunch.

like image 45
Drew Avatar answered Sep 29 '22 12:09

Drew