Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to troubleshoot why the Endpoints in my service don't get updated?

I have a Kubernetes cluster running on the Google Kubernetes Engine.

I have a deployment that I manually (by editing the hpa object) scaled up from 100 replicas to 300 replicas to do some load testing. When I was load testing the deployment by sending HTTP requests to the service, it seemed that not all pods were getting an equal amount of traffic, only around 100 pods were showing that they were processing traffic (by looking at their CPU-load, and our custom metrics). So my suspicion was that the service is not load balancing the requests among all the pods equally.

If I checked the deployment, I could see that all 300 replicas were ready.

$ k get deploy my-app --show-labels
NAME                DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE       LABELS
my-app              300       300       300          300         21d       app=my-app

On the other hand, when I checked the service, I saw this:

$ k describe svc my-app
Name:              my-app
Namespace:         production
Labels:            app=my-app
Selector:          app=my-app
Type:              ClusterIP
IP:                10.40.9.201
Port:              http  80/TCP
TargetPort:        http/TCP
Endpoints:         10.36.0.5:80,10.36.1.5:80,10.36.100.5:80 + 114 more...
Port:              https  443/TCP
TargetPort:        https/TCP
Endpoints:         10.36.0.5:443,10.36.1.5:443,10.36.100.5:443 + 114 more...
Session Affinity:  None
Events:            <none>

What was strange to me is this part

Endpoints:         10.36.0.5:80,10.36.1.5:80,10.36.100.5:80 + 114 more...

I was expecting to see 300 endpoints there, is that assumption correct?

(I also found this post, which is about a similar issue, but there the author was experiencing only a few minutes delay until the endpoints were updated, but for me it didn't change even in half an hour.)

How could I troubleshoot what was going wrong? I read that this is done by the Endpoints controller, but I couldn't find any info about where to check its logs.

Update: We managed to reproduce this a couple more times. Sometimes it was less severe, for example 381 endpoints instead of 445. One interesting thing we noticed is that if we retrieved the details of the endpoints:

$ k describe endpoints my-app
Name:         my-app
Namespace:    production
Labels:       app=my-app
Annotations:  <none>
Subsets:
  Addresses:          10.36.0.5,10.36.1.5,10.36.10.5,...
  NotReadyAddresses:  10.36.199.5,10.36.209.5,10.36.239.2,...

Then a bunch of IPs were "stuck" in the NotReadyAddresses state (not the ones that were "missing" from the service though, if I summed the number of IPs in Addresses and NotReadyAddresses, that was still less than the total number of ready pods). Although I don't know if this is related at all, I couldn't find much info online about this NotReadyAddresses field.

like image 335
Mark Vincze Avatar asked Nov 07 '22 05:11

Mark Vincze


2 Answers

It turned out that this is caused by using preemptible VMs in our node pools, it doesn't happen if the nodes are not preemtibles.
We couldn't figure out more details of the root cause, but using preemtibles as the nodes is not an officially supported scenario anyway, so we switched to regular VMs.

like image 154
Mark Vincze Avatar answered Nov 15 '22 09:11

Mark Vincze


Pod IPs can be added to NotReadyAddresses if a health/readiness probe is failing. This will in turn cause the pod IP to fail to be automatically added to the endpoints, meaning that the kubernetes service can't connect to the pod.

like image 39
Chris Halcrow Avatar answered Nov 15 '22 09:11

Chris Halcrow