Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RedShift Node Failover

I have a RedShift cluster of 4 nodes.

  1. When one of the nodes goes down, will the entire cluster become unavailable?
  2. If yes - for how long?
  3. When the cluster gets back - is it returned to exactly the same point it was before the failure, or the data may be rolled back a to S3 snapshot from a few hours ago?
  4. How can I simulate this situation to check this scenario by myself?

Thanks a lot!

like image 381
diemacht Avatar asked Dec 12 '13 09:12

diemacht


People also ask

What if leader node fails in Redshift?

In case of node failure(s), Amazon Redshift automatically provisions new node(s) and begins restoring data from other drives within the cluster or from Amazon S3. It prioritizes restoring your most frequently queried data so your most frequently executed queries will become performant quickly.

What happens when nodes are added to a Redshift cluster?

Because Amazon Redshift distributes and runs queries in parallel across all of a cluster's compute nodes, you can increase query performance by adding nodes to your cluster.

Is Redshift fault tolerant?

Fault tolerant: There are multiple features that enhance the reliability of your data warehouse cluster. For example, Amazon Redshift continuously monitors the health of the cluster and automatically re-replicates data from failed drives and replaces nodes as necessary for fault tolerance.


2 Answers

If it's a single node failure - amazon will start a new node and stream data from other nodes (each block is written to two different nodes if any). In such case, we can expect:

  1. Downtime of the entire cluster till a new node starts up + filled with the DB information. Should be about 3-4 minutes.
  2. After these 3-4 minutes that cluster will return to exactly the same point it was before it went down. The cluster will be available to both reads and writes.
  3. Some slowdown will be experienced due to data redistribution in the cluster.

In case more than one nodes fails, redshift will restore itself from the latest S3 backup. S3 backups are done on the following occasions:

  1. If it's been 8 hours since the last backup
  2. If RedShift was filled with more then 5GB of data since the last backup
  3. Manually
  4. You have the option of a final snapshot when you chose to terminate your cluster
like image 179
diemacht Avatar answered Sep 17 '22 10:09

diemacht


It just happened to my cluster - one of nodes failed. It took almost 20 minutes to get noticed in the dashboard (unhealthy was shown in 'Performance' tab, but healthy in 'Status' tab).

After 1h from initial failure, cluster changed its state to 'modifying' and after another 1h a new node was in place.

There is a message in 'Recent Events':

A node on Amazon Redshift cluster 'xxx' was automatically replaced at 2013-12-18 11:42 UTC. The cluster is now operating normally.

For the whole time cluster was unavailable - no queries were run, no imports were possible.

Data is exactly the same as in the moment of a failure.

like image 37
Tomasz Tybulewicz Avatar answered Sep 18 '22 10:09

Tomasz Tybulewicz