I have a RedShift cluster of 4 nodes. <ol> <li>When one of the nodes goes down, will the entire cluster become unavailable?</li> <li>If yes - for how long?</li> <li>When the cluster gets back - is it returned to exactly the same point it was before the failure, or the data may be rolled back a to S3 snapshot from a few hours ago?</li> <li>How can I simulate this situation to check this scenario by myself?</li> </ol> Thanks a lot!

If it's a single node failure - amazon will start a new node and stream data from other nodes (each block is written to two different nodes if any). In such case, we can expect: <ol> <li>Downtime of the entire cluster till a new node starts up + filled with the DB information. Should be about 3-4 minutes.</li> <li>After these 3-4 minutes that cluster will return to exactly the same point it was before it went down. The cluster will be available to both reads and writes.</li> <li>Some slowdown will be experienced due to data redistribution in the cluster.</li> </ol> In case more than one nodes fails, redshift will restore itself from the latest S3 backup. S3 backups are done on the following occasions: <ol> <li>If it's been 8 hours since the last backup</li> <li>If RedShift was filled with more then 5GB of data since the last backup</li> <li>Manually</li> <li>You have the option of a final snapshot when you chose to terminate your cluster</li> </ol>

It just happened to my cluster - one of nodes failed. It took almost 20 minutes to get noticed in the dashboard (unhealthy was shown in 'Performance' tab, but healthy in 'Status' tab). After 1h from initial failure, cluster changed its state to 'modifying' and after another 1h a new node was in place. There is a message in 'Recent Events': <blockquote> A node on Amazon Redshift cluster 'xxx' was automatically replaced at 2013-12-18 11:42 UTC. The cluster is now operating normally. </blockquote> For the whole time cluster was unavailable - no queries were run, no imports were possible. Data is exactly the same as in the moment of a failure.

RedShift Node Failover

2 Answers

If it's a single node failure - amazon will start a new node and stream data from other nodes (each block is written to two different nodes if any). In such case, we can expect:

Downtime of the entire cluster till a new node starts up + filled with the DB information. Should be about 3-4 minutes.
After these 3-4 minutes that cluster will return to exactly the same point it was before it went down. The cluster will be available to both reads and writes.
Some slowdown will be experienced due to data redistribution in the cluster.

In case more than one nodes fails, redshift will restore itself from the latest S3 backup. S3 backups are done on the following occasions:

If it's been 8 hours since the last backup
If RedShift was filled with more then 5GB of data since the last backup
Manually
You have the option of a final snapshot when you chose to terminate your cluster

179

answered Sep 17 '22 10:09

diemacht

It just happened to my cluster - one of nodes failed. It took almost 20 minutes to get noticed in the dashboard (unhealthy was shown in 'Performance' tab, but healthy in 'Status' tab).

After 1h from initial failure, cluster changed its state to 'modifying' and after another 1h a new node was in place.

There is a message in 'Recent Events':

A node on Amazon Redshift cluster 'xxx' was automatically replaced at 2013-12-18 11:42 UTC. The cluster is now operating normally.

For the whole time cluster was unavailable - no queries were run, no imports were possible.

Data is exactly the same as in the moment of a failure.

answered Sep 18 '22 10:09

Tomasz Tybulewicz

Related questions
                            
                                StepFunction's Choice rule to check if array field is empty
                            
                                Function not found after manually deleting a function in a SAM CloudFormation stack
                            
                                How to use a pretrained model from s3 to predict some data?
                            
                                Error Publishing .net core 3.0 AWS Serverless Application - Ambiguous project name 'bootstrap'."
                            
                                java.lang.BootstrapMethodError: call site initialization exception from Athena java class
                            
                                Looking for a good way to monitor ECS deploy failure notification?
                            
                                Deployment flow of an Angular app over on AWS EC2 instances
                            
                                How to enable hourly granularity for cost and usage data in AWS?
                            
                                How to deal with large dependencies in AWS Lambda?
                            
                                Ceating dynamodb table says "invalid One or more parameter values were invalid: Some index key attributes are not defined in AttributeDefinitions"
                            
                                aws eb deploy doesn't load environment variables
                            
                                How do I get the ARN for a security group in AWS
                            
                                AWS Lambda Container Image Support Vs Fargate
                            
                                Terraform not saving state of ECS Cluster containerInsights setting
                            
                                CloudFront caching - How to ensure CF refreshes from S3 origin server after expiry?
                            
                                Kernel panic - Creating own AMI (Amazon Machine Image)
                            
                                How to configure email accounts like [email protected] or [email protected] on AWS
                            
                                Create AWS cache clusters in VPC with CloudFormation
                            
                                AWS EC2 forgot password and not able to use SSH
                            
                                Set ACL on file_put_contents using PHP AWS SDK

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

RedShift Node Failover

Tags:

amazon-web-services

amazon-redshift

failovercluster

diemacht

People also ask

2 Answers

diemacht

Tomasz Tybulewicz

Recent Activity

Donate For Us