Recently I'm considering to use Amazon RDS Multi-AZ deployment for a service in production environment, and I've read the related documents.
However, I have a question about the failover. In the FAQ of Amazon RDS, failover is described as follows:
Q: What happens during Multi-AZ failover and how long does it take?
Failover is automatically handled by Amazon RDS so that you can resume database operations as quickly as possible without administrative intervention. When failing over, Amazon RDS simply flips the canonical name record (CNAME) for your DB Instance to point at the standby, which is in turn promoted to become the new primary. We encourage you to follow best practices and implement database connection retry at the application layer. Failover times are a function of the time it takes crash recovery to complete. Start-to-finish, failover typically completes within three minutes.
From the above description, I guess there must be a monitoring service which could detect failure of primary instance and do the flipping.
My question is, which AZ does this monitoring service host in? There are 3 possibilities: 1. Same AZ as the primary 2. Same AZ as the standby 3. Another AZ
Apparently 1&2 won't be the case, since it could not handle the situation that entire AZ being unavailable. So, if 3 is the case, what if the AZ of the monitoring service goes down? Is there another service to monitor this monitoring service? It seems to be an endless domino.
So, how is Amazon ensuring the availability of RDS in Multi-AZ deployment?
How it works. In an Amazon RDS Multi-AZ deployment, Amazon RDS automatically creates a primary database (DB) instance and synchronously replicates the data to an instance in a different AZ. When it detects a failure, Amazon RDS automatically fails over to a standby instance without manual intervention.
If a storage volume on your primary instance fails in a Multi-AZ deployment, Amazon RDS automatically initiates a failover to the up-to-date standby (or to a replica in the case of Amazon Aurora).
When you change your Single-AZ instance to Multi-AZ, you don't experience any downtime on the instance. During the modification, Amazon RDS creates a snapshot of the instance's volumes. Then, this snapshot is used to create new volumes in another Availability Zone.
What would happen to an RDS (Relational Database Service) Multi-Availability Zone deployment if the primary DB instance fails? IP of the primary DB Instance is switched to the standby DB Instance. A new DB instance is created in the standby availability zone.
So, how is Amazon ensuring the availability of RDS in Multi-AZ deployment?
I think that the "how" in this case is abstracted by design away from the user, given that RDS is a PaaS service. A multi-AZ deployment has a great deal that is hidden, however, the following are true:
In his blog post, John Gemignani mentions the notion of an observer
managing which RDS instance is active in the multi-AZ architecture. But to your point, what is the observer
? And where is it observing from?
Here's my guess, based upon my experience with AWS:
The
observer
in an RDS multi-AZ deployment is a highly available service that is deployed throughout every AZ in every region that RDS multi-AZ is available, and makes use of existing AWS platform services to monitor the health and state of all of the infrastructure that may affect an RDS instance. Some of the services that make up theobserver
may be part of the AWS platform itself, and otherwise hidden from the user.
I would be willing to bet that the same underlying services that comprise CloudWatch Events is used in some capacity for the RDS multi-AZ observer
. From Jeff Barr's blog post announcing CloudWatch Events, he describes the service this way:
You can think of CloudWatch Events as the central nervous system for your AWS environment. It is wired in to every nook and cranny of the supported services, and becomes aware of operational changes as they happen. Then, driven by your rules, it activates functions and sends messages (activating muscles, if you will) to respond to the environment, making changes, capturing state information, or taking corrective action.
Think of the observer
the same way - it's a component of the AWS platform that provides a function that we, as the users of the platform do not need to think about. It's part of AWS's responsibility in the Shared Responsibility Model.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With