Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLB Target Group health checks are out of control

I have a Network Load Balancer and an associated Target Group that is configured to do health checks on the EC2 instances. The problem is that I am seeing a very high number of health check requests; multiple every second.

The default interval between checks is supposed to be 30 seconds, but they are coming about 100x more frequently than they should.

My stack is built in CloudFormation, and I've tried overriding HealthCheckIntervalSeconds, which has no effect. Interestingly, when I tried to manually change the interval in the console, I found those values greyed out:

Edit Healthcheck Settings

Here is the relevant part of the template, with my attempt at changing the interval commented out:

NLB:   Type: "AWS::ElasticLoadBalancingV2::LoadBalancer"   Properties:     Type: network     Name: api-load-balancer     Scheme: internal     Subnets:        - Fn::ImportValue: PrivateSubnetA       - Fn::ImportValue: PrivateSubnetB       - Fn::ImportValue: PrivateSubnetC  NLBListener:   Type : AWS::ElasticLoadBalancingV2::Listener   Properties:     DefaultActions:       - Type: forward         TargetGroupArn: !Ref NLBTargetGroup     LoadBalancerArn: !Ref NLB     Port: 80     Protocol: TCP  NLBTargetGroup:   Type: AWS::ElasticLoadBalancingV2::TargetGroup   Properties:     # HealthCheckIntervalSeconds: 30     HealthCheckPath: /healthcheck     HealthCheckProtocol: HTTP     # HealthyThresholdCount: 2     # UnhealthyThresholdCount: 5     # Matcher:     #   HttpCode: 200-399     Name: api-nlb-http-target-group     Port: 80     Protocol: TCP      VpcId: !ImportValue PublicVPC 

My EC2 instances are in private subnets with no access from the outside world. The NLB is internal, so there's no way of accessing them without going through API Gateway. API Gateway has no /healthcheck endpoint configured, so that rules out any activity coming from outside of the AWS network, like people manually pinging the endpoint.

This is a sample of my app's log taken from CloudWatch, while the app should be idle:

07:45:33 {"label":"Received request URL","value":"/healthcheck","type":"trace"} 07:45:33 {"label":"Received request URL","value":"/healthcheck","type":"trace"} 07:45:33 {"label":"Received request URL","value":"/healthcheck","type":"trace"} 07:45:33 {"label":"Received request URL","value":"/healthcheck","type":"trace"} 07:45:34 {"label":"Received request URL","value":"/healthcheck","type":"trace"} 07:45:34 {"label":"Received request URL","value":"/healthcheck","type":"trace"} 07:45:34 {"label":"Received request URL","value":"/healthcheck","type":"trace"} 07:45:35 {"label":"Received request URL","value":"/healthcheck","type":"trace"} 07:45:35 {"label":"Received request URL","value":"/healthcheck","type":"trace"} 07:45:35 {"label":"Received request URL","value":"/healthcheck","type":"trace"} 

I'm getting usually 3 to 6 requests every second, so I'm wondering if this is just the way the Network Load Balancers work, and AWS still haven't documented that (or I haven't found it), or otherwise how I might fix this issue.

like image 606
Miles Avatar asked Jan 07 '18 08:01

Miles


2 Answers

Update: this has been answered on the related aws forum post which confirms that it's normal behaviour for network load balancers and cites their distributed nature as the reason. There is no way to configure a custom interval. At this moment, the docs are still out of date and specify otherwise.


This is either a bug in NLB Target Groups, or normal behaviour with incorrect documentation. I've come to this conclusion because:

  • I verified that the health checks are coming from the NLB
  • The configuration options are greyed out on the console
    • inferring that AWS know about or imposed this limitation
  • The same results are being observed by others
  • The documentation is specifically for Network Load Balancers
  • AWS docs commonly lead you on a wild goose chase

In this case I think it might be normal behaviour that's been documented incorrectly, but there's no way of verifying that unless someone from AWS can, and it's almost impossible to get an answer for an issue like this on the aws forum.

It would be useful to be able to configure the setting, or at least have the docs updated.

like image 52
Miles Avatar answered Sep 16 '22 13:09

Miles


Edit: Just wanted to share an update on this now in Sept 2021. If you are using an NLB you may get an email similar to this:

We are contacting you regarding an upcoming change to your Network Load Balancer(s). Starting on September 9, 2021, we will upgrade NLB's target health checking system. The upgraded system offers faster failure identification, improves target health state accuracy, and allows ELB to weight out of impacted Availability Zones during partial failure scenarios.

As a part of this update, you may notice that there is less health check traffic to backend targets, reducing the targets NetworkIn/Out metrics, as we have removed redundant health checks.

I expect this should resolve the issues with targets receiving many healthchecks when using NLB.

Previous answer:

AWS employee here. To elaborate a bit on the accepted answer, the reason you may see bursts of health check requests is that NLB uses multiple distributed health checkers to evaluate target health. Each of these health checkers will make a request the target at the interval you specify, but all of them are going to make a request to it at that interval, so you will see one request from each of the distributed probes. The target health is then evaluated based on how many of the probes were successful.

You can read a very detailed explanation written here by another AWS employee, under "A look at Route 53 health checks": https://medium.com/@adhorn/patterns-for-resilient-architecture-part-3-16e8601c488e

My recommendation for healthchecks is to code healthchecks to be very light. A lot of people make the mistake of overloading their healthcheck to also do things like check the backend database, or run other checks. Ideally a healthcheck for your load balancer is doing nothing but returning a short string like "OK". In this case it should take less than a millisecond for your code to serve the healthcheck request. If you follow this pattern then occasional bursts of 6-8 healthcheck requests should not overload your process.

like image 45
nathanpeck Avatar answered Sep 17 '22 13:09

nathanpeck