Generally when using AWS RDS, the recommended practice to achieve high availability is to deploy hot replica in different AZ (multi AZ deployment). Also, some read replicas can be brought up to improve read performance.
I've read AWS Aurora documentation, it uses common virtual storage layer, which is replicated on 3 AZ, with two copies in each AZ.
My question is this: Is there any need to use Amazon multi AZ deployment of Aurora DB cluster, if Aurora itself is capable of healing itself, and has its storage distributed over multi AZs? If it keeps 2 storage copies in each of 3 AZs, then its as reliable as using the multi AZ replica setup for failover. Also, during failover. it automatically creates another instance (if no read replica exist) or switches the primary. I really do not understand any need to create additional requirement of using multi AZ aurora cluster to 'improve' availability.
Is it possible that there's some scenario where availibility would suffer under default Aurora deployment? What happens during loss of an entire AZ which contains the primary Aurora DB node?
Multi-AZ deployments gives high availability and automatic failover. Amazon RDS creates a storage-level replica of the database in a second Availability Zone. It then synchronously replicates data from the primary to the standby DB instance for high availability.
Amazon RDS Multi-AZ deployments provide enhanced availability and durability for RDS database (DB) instances, making them a natural fit for production database workloads.
An Aurora DB cluster is fault tolerant by design. The cluster volume spans multiple Availability Zones (AZs) in a single AWS Region, and each Availability Zone contains a copy of the cluster volume data.
The primary host of the RDSRDSAmazon Relational Database Service (Amazon RDS) is a collection of managed services that makes it simple to set up, operate, and scale databases in the cloud.https://aws.amazon.com › rdsFully Managed Relational Database - Amazon RDS - Amazon Web Services Multi-AZ instance is unreachable due to loss of network connectivity: This reason indicates that the Multi-AZ failover and database instance restart were caused by a transient network issue that affected the primary host of your Multi-AZ deployment.
If you are only interested in your data not being lost, then a non-multi AZ would probably work fine because, as you said, the data is replicated for you.
But the running instance of Aurora still lives on a physical machine, and that physical machine lives in a single AZ, so if that AZ goes down, while you may not lose any data you won't necessarily have access to it.
A multi-AZ deployment has a physical machine running in more than one AZ, so if one AZ goes down, the database server in the other AZ can still serve your requests.
The RDS Multi-AZ feature is much simpler for Aurora deployments than it is for non-Aurora deployments: An Aurora Replica is a Multi-AZ failover target in addition to a read-scaling endpoint, so creating a Multi-AZ Aurora deployment is as simple as deploying an Aurora Replica in a different Availability Zone from the primary instance.
This behavior is different from standard non-Aurora Multi-AZ deployments, which maintain a separate synchronously-replicated 'standby instance' which cannot be used as a read-scaling endpoint, and vice versa (standard RDS Read Replicas cannot be used as Multi-AZ failover targets).
Even though Aurora data is backed up across AZs, having a replica instance already running can still significantly reduce the amount of time it takes to recover from a failure of the primary instance. The typical amount of time Aurora takes to recover from a failover with an Aurora Replica available is 1-2 minutes, compared to 10 minutes without a Replica, as described in Fault Tolerance for an Aurora DB Cluster:
If the primary instance in a DB cluster fails, Aurora automatically fails over to a new primary instance in one of two ways:
- By promoting an existing Aurora Replica to the new primary instance
- By creating a new primary instance
If the DB cluster has one or more Aurora Replicas, then an Aurora Replica is promoted to the primary instance during a failure event. [...] However, service is typically restored in less than 120 seconds, and often less than 60 seconds. [...]
If the DB cluster doesn't contain any Aurora Replicas, then the primary instance is recreated during a failure event. [...] Service is restored when the new primary instance is created, which typically takes less than 10 minutes.
Promoting an Aurora Replica to the primary instance is much faster than creating a new primary instance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With