AWS/ELB connection draining issues

Q: What is AWS ELB connection draining?

AWS ELB connection draining prevents breaking open network connections while taking an instance out of service, updating its software, or replacing it with a fresh instance that contains updated software.

Q: What is the default setting of connection draining in ELB?

When you enable connection draining, you can specify a maximum time for the load balancer to keep connections alive before reporting the instance as de-registered. The maximum timeout value can be set between 1 and 3,600 seconds (the default is 300 seconds).

Q: What is true about connection draining?

Connection draining is a process that ensures that existing, in-progress requests are given time to complete when a VM is removed from an instance group or when an endpoint is removed from a zonal network endpoint group (NEG).

Q: How many connections can ELB handle?

60,000 active flows (or connections) (sampled per minute). 1 GB per hour for EC2 instances, containers and IP addresses as targets.

Tags:

nginx

amazon-web-services

amazon-elb

aws-ec2

This question has been asked on the AWS forums without any responses. Below is the original question

Hi!

We are doing rolling upgrades of our API-instances behind an ELB and are seeing alarmingly long times when waiting for the connection draining to finish. The scenario is as follows:

We're running two identical systems, 4x c3.large behind an ELB, one system for dev and one system for production. The only difference between the two systems is that the production system continuously serves requests.

A rolling upgrade on the dev system takes about 3 minutes for all 4 instances when there is no traffic. On the production system these times fluctuate between 6 and 17+ minutes. For reasons we need to do these rolling upgrades on average about 2 times per hour and then 17+ minutes for a rolling upgrade is starting to become a problem.

All our API calls are < 100ms so there is no long running requests that should hold the connection draining back for that long. We have played around with changing the values for both idle timout and connection draining timout on the ELB with no good results.

When lowering the connection draining timeout we're seeing 502 responses from the API since it forceably drops the connections and lowering the idle timeout seems to have no effect.

All in all, we would like to know what can be done to reduce these times. As our requests all are < 100ms it should in theory not take more than a second or two to drain the connections from an instance. Is there something we are missing here?

A last note: We tried turning off connection draining all together and this seemed to work better than lowering the connection draining timout. On average there was only 1 or 2 errors per test run and some runs had no errors. Is this because the response times are so fast? Our responses are also relatively small so it might be possible that the TCP response is saved in the OS output buffer so it can respond even if connection draining is turned off? What is the difference between having connection draining timeout set to 0 and turned off?

Additional info:

All traffic is HTTPS
SSL termination happens on the instances
keep-alive is enabled on nginx (tried to vary the value here too without any results)

Thanks!

486

asked Feb 26 '15 10:02

Slim

1 Answers

This is a complex question with a number of variables and so I can make a few suggestions to look into.

1) Check your Health Check Interval, Response Timeout, and Unhealthy Threshold settings. If, as part of your rolling upgrade you terminate your instances while the ELB is still performing health checks, the ELB is going to wait the duration of "Response Timeout" irrespective of connection draining. If that timeout is set for 1 minute with 3 retries ("Unhealthy Threshold") that is 3 minutes per server before the ELB declares the instance dead. So, even with connection draining set to zero, no new requests will go to that instance but the ELB will be waiting for 3 minutes until it decides the instance is actually dead.

Worst case - multiply by 4 instances and you're at 12 minutes before the ELB understands all instances are dead. In other words - the ELB is busy waiting for healthchecks to actually fail.

2) Are you unregistering your instances from the ELB prior to terminating them? This avoids the issue in #1 above.

3) Disabling Connection Draining and Enabling Connection Draining with a Timeout value of zero should provide the equivalent functionality

117

answered Sep 28 '22 06:09

GMan

Related questions
                            
                                Where is the API documentation for boto3 resources?
                            
                                Failed to retrieve credentials from EC2 Instance Metadata Service
                            
                                Using External Identity Providers with Server Side Authentication
                            
                                Publish AWS SNS message to Pagerduty
                            
                                Custom version label with aws code pipeline
                            
                                Organize multiple projects (AWS)
                            
                                Use ECR images in EKS from another account
                            
                                How to provide AWS API Gateway Custom Authorizer a Lambda Permission?
                            
                                How can I visualize timeseries data aggregated by more than one dimension on AWS insights?
                            
                                What combination of Block Public Access settings makes my s3 bucket viewable to everyone?
                            
                                Cannot access ports in AWS ECS EC2 instance
                            
                                AWS CDK: Is there a way to create database schema using CDK?
                            
                                Programmatically Stop AWS EC2 in case of inactivity
                            
                                AWS Custom CloudWatch metrics - Aggregate by Auto-Scaling group
                            
                                How do you determine your permissions in AWS S3 through the Java SDK?
                            
                                How can Java determine if running on AWS
                            
                                Do Delay Queue messages count as "In Flight" in SQS?
                            
                                (AWS) Can't launch RDS in my chosen VPC
                            
                                Elastic Beanstalk unable to install packages
                            
                                Lambda to create EMR Cluster don't fire the cluster creation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With