Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS/ELB connection draining issues

This question has been asked on the AWS forums without any responses. Below is the original question


Hi!

We are doing rolling upgrades of our API-instances behind an ELB and are seeing alarmingly long times when waiting for the connection draining to finish. The scenario is as follows:

We're running two identical systems, 4x c3.large behind an ELB, one system for dev and one system for production. The only difference between the two systems is that the production system continuously serves requests.

A rolling upgrade on the dev system takes about 3 minutes for all 4 instances when there is no traffic. On the production system these times fluctuate between 6 and 17+ minutes. For reasons we need to do these rolling upgrades on average about 2 times per hour and then 17+ minutes for a rolling upgrade is starting to become a problem.

All our API calls are < 100ms so there is no long running requests that should hold the connection draining back for that long. We have played around with changing the values for both idle timout and connection draining timout on the ELB with no good results.

When lowering the connection draining timeout we're seeing 502 responses from the API since it forceably drops the connections and lowering the idle timeout seems to have no effect.

All in all, we would like to know what can be done to reduce these times. As our requests all are < 100ms it should in theory not take more than a second or two to drain the connections from an instance. Is there something we are missing here?

A last note: We tried turning off connection draining all together and this seemed to work better than lowering the connection draining timout. On average there was only 1 or 2 errors per test run and some runs had no errors. Is this because the response times are so fast? Our responses are also relatively small so it might be possible that the TCP response is saved in the OS output buffer so it can respond even if connection draining is turned off? What is the difference between having connection draining timeout set to 0 and turned off?

Additional info:

  • All traffic is HTTPS
  • SSL termination happens on the instances
  • keep-alive is enabled on nginx (tried to vary the value here too without any results)

Thanks!

like image 486
Slim Avatar asked Feb 26 '15 10:02

Slim


People also ask

What is AWS ELB connection draining?

AWS ELB connection draining prevents breaking open network connections while taking an instance out of service, updating its software, or replacing it with a fresh instance that contains updated software.

What is the default setting of connection draining in ELB?

When you enable connection draining, you can specify a maximum time for the load balancer to keep connections alive before reporting the instance as de-registered. The maximum timeout value can be set between 1 and 3,600 seconds (the default is 300 seconds).

What is true about connection draining?

Connection draining is a process that ensures that existing, in-progress requests are given time to complete when a VM is removed from an instance group or when an endpoint is removed from a zonal network endpoint group (NEG).

How many connections can ELB handle?

60,000 active flows (or connections) (sampled per minute). 1 GB per hour for EC2 instances, containers and IP addresses as targets.


1 Answers

This is a complex question with a number of variables and so I can make a few suggestions to look into.

1) Check your Health Check Interval, Response Timeout, and Unhealthy Threshold settings. If, as part of your rolling upgrade you terminate your instances while the ELB is still performing health checks, the ELB is going to wait the duration of "Response Timeout" irrespective of connection draining. If that timeout is set for 1 minute with 3 retries ("Unhealthy Threshold") that is 3 minutes per server before the ELB declares the instance dead. So, even with connection draining set to zero, no new requests will go to that instance but the ELB will be waiting for 3 minutes until it decides the instance is actually dead.

Worst case - multiply by 4 instances and you're at 12 minutes before the ELB understands all instances are dead. In other words - the ELB is busy waiting for healthchecks to actually fail.

2) Are you unregistering your instances from the ELB prior to terminating them? This avoids the issue in #1 above.

3) Disabling Connection Draining and Enabling Connection Draining with a Timeout value of zero should provide the equivalent functionality

like image 117
GMan Avatar answered Sep 28 '22 06:09

GMan