Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Health probe marks instances as unhealthy but the aren't

I use a VM scale set for my node application. My app has an action which is public accessible via www.mydomain.com/api/healthcheck and prints just some json. When I configure my health probe to use TCP protocol, everything works fine and also my api returns me the expected json (and status 200). However, when I now switch my health probe to use HTTP and path=/api/healthcheck, my website isn't accesible anymore (ERR_CONNECTION_TIMED_OUT... I guess the loadbalancer takes out all instances because the health probe tells him that every instance is unhealthy)

I use nginx in front of my node app, but I also tried (for testing) to configure my LoadBalancer to route port 80 to backendport 8080 (where my node app is running on every machine, so I can avoid nginx proxy). But I get the same behaviour.

I'm out of ideas why my custom health check doesn't work. Hope you can help.


Edit: For testing, I did the following:

  • run another nodejs app on port 3000 on every VM, which just prints "hello world" (without nginx proxy!)
  • create a LB rule for port 3000 and also configure my NSG to allow :3000 for all
  • at the beginning, my health probe is configured to use tcp
  • result: mydoamin.com:3000/hello is available (prints hello and returns 200)
  • now I configure my health probe to use http-protocol, port 3000 and location /hello.
  • result: my whole web app isn't available anymore
like image 461
Munchkin Avatar asked Aug 08 '17 13:08

Munchkin


1 Answers

I can't see your server's code so its hard to figure out. If you shared some code it would be easier.

So lets try to analyze the situation :

Initial Check

Connection to the instances has timed out

Try to perform the following command from your machines terminal

curl –I private-IP-address-of-the-instance:port/health-check-target-page

now depending on the otucome we have different possible causes...

Initial Check Result : non-200 response

  • No target page is configured on the instance.
  • The value of the Content-Length header in the response is not set.
  • The application is not configured to receive requests from the load balancer or to return a 200 response code.

Initial Check Result : able to connect directly to the instance

  • The instance is failing to respond within the configured response timeout period.
  • The instance is under significant load and is taking longer than your configured response timeout period to respond.
  • If you are using an HTTP or an HTTPS connection and the health check is being performed on a target page specified in the ping path field (for example, HTTP:80/index.html), the target page might be taking longer to respond than your configured timeout.

Other : Instance is not receiving traffic from the load balancer

Problem: The security group for the instance is blocking the traffic from the load balancer.

Do a packet capture on the instance to verify the issue. Use the following command:

tcpdump port health-check-port
like image 95
EMX Avatar answered Oct 17 '22 20:10

EMX