Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multi-AZ RDS test failover and connection monitoring

My question has two parts:

  1. What is the best way to initiate an RDS failover for testing purposes?
  2. How can I monitor the connection during failover in order to observe the time that it takes for AWS to reconnect the user to the standby instance?

With respect to part (1): If I understand correctly, all instance modifications are made on the standby and then AWS fails over by flipping the CNAME over to the standby as the primary is updated, so if I were to make any kind of instance modification and select "apply immediately," it should cause a failover, correct?

With respect to part (2): I am looking specifically for a way of monitoring the failover of an Oracle RDS instance, whether through a lambda function, a bash script, or some other means. As far as I can tell, it is not possible to use ping with RDS, even when I allow all ICMP traffic via the security group. I can connect without trouble using telnet or an SQL client. What I would like though is some way of doing something like periodically pinging the database during a failover to see when the IP associated with the connection string switches over and how long it takes. Any suggestions?

like image 784
amparito Avatar asked Mar 08 '17 16:03

amparito


People also ask

How do I monitor a failover in RDS?

To see if a failover has occurred, open the Amazon RDS console, and then choose Events from the navigation pane. If AWS CloudTrail logging is enabled, then you can check the logs to see whether the event was planned or unplanned. For example, scaling compute or applying pending OS upgrades can trigger a failover.

What happens during RDS Multi-AZ failover?

In an Amazon RDS Multi-AZ deployment, Amazon RDS automatically creates a primary database (DB) instance and synchronously replicates the data to an instance in a different AZ. When it detects a failure, Amazon RDS automatically fails over to a standby instance without manual intervention.

How does multi-AZ failover work?

According to the SLA's provided by AWS, whenever an instance marked as multi-AZ goes through a failure (whether it is a network failure, disk failure, etc); AWS automatically shifts the traffic to its standby running on a separate AZ on the same AWS region.


2 Answers

  1. Correct, RDS will make your modifications on the failover instance and then failover to it. Per their documentation:

The availability benefits of Multi-AZ deployments also extend to planned maintenance and backups. In the case of system upgrades like OS patching or DB Instance scaling, these operations are applied first on the standby, prior to the automatic failover. As a result, your availability impact is, again, only the time required for automatic failover to complete.

To simulate failover, simply reboot with failover when rebooting, instead of rebooting both. From the linked documentation:

Reboot with failover is beneficial when you want to simulate a failure of a DB instance for testing, or restore operations to the original AZ after a failover occurs.

  1. Write a script that, on a regular interval, connects with a SQL Client and performs a quick select on a table of your preference. You can use this to measure true downtime during the failover; we have a tool very similar to this that we use when getting estimates of modifications on a test RDS before we apply it to our production RDS. Our tool simply writes to console with a timestamp and whether it failed/succeeded every few seconds. The tool will write success before the reboot, failure during, and success again after the cutover completes.

Additional Resources:

  • Modifying an Amazon RDS DB Instance and Using the Apply Immediately Parameter
  • Modifying a DB Instance Running the Oracle Database Engine
like image 171
Anthony Neace Avatar answered Oct 18 '22 02:10

Anthony Neace


Update on this:

I ended up using a simple bash script:

date; while true; date; do nc -vz DBNAME.REGION.rds.amazonaws.com PORT; sleep 1; done

Note: the above is for netcat-openbsd. If using netcat-traditional, you'll need to modify this.

This polls the database each second to see if it's still possible to connect. Typically when I ran this and then initiated reboot with failover, the connection would simply dangle during the failover then display a timeout error when the failover was complete and connectivity resumed, presumably because the failover usually takes longer than the reboot. If the reboot happens to take longer than the failover though, there may be a period of time during which the connection is refused as the reboot completes. In any case, using this method, I was able to get a consistent failover time of 2:08.

It seeems, however, that unlike I originally thought, most instance modifications do not involve a failover at all. I have tested resizing the instance as well as changing the option groups and parameter groups and did not experience any downtime.

Changing the database engine does result in a failover.

like image 43
amparito Avatar answered Oct 18 '22 03:10

amparito