I have a Kubernetes cluster running on the Google Kubernetes Engine.
I have a deployment that I manually (by editing the hpa
object) scaled up from 100 replicas to 300 replicas to do some load testing. When I was load testing the deployment by sending HTTP requests to the service, it seemed that not all pods were getting an equal amount of traffic, only around 100 pods were showing that they were processing traffic (by looking at their CPU-load, and our custom metrics). So my suspicion was that the service is not load balancing the requests among all the pods equally.
If I checked the deployment
, I could see that all 300 replicas were ready.
$ k get deploy my-app --show-labels
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE LABELS
my-app 300 300 300 300 21d app=my-app
On the other hand, when I checked the service
, I saw this:
$ k describe svc my-app
Name: my-app
Namespace: production
Labels: app=my-app
Selector: app=my-app
Type: ClusterIP
IP: 10.40.9.201
Port: http 80/TCP
TargetPort: http/TCP
Endpoints: 10.36.0.5:80,10.36.1.5:80,10.36.100.5:80 + 114 more...
Port: https 443/TCP
TargetPort: https/TCP
Endpoints: 10.36.0.5:443,10.36.1.5:443,10.36.100.5:443 + 114 more...
Session Affinity: None
Events: <none>
What was strange to me is this part
Endpoints: 10.36.0.5:80,10.36.1.5:80,10.36.100.5:80 + 114 more...
I was expecting to see 300 endpoints there, is that assumption correct?
(I also found this post, which is about a similar issue, but there the author was experiencing only a few minutes delay until the endpoints were updated, but for me it didn't change even in half an hour.)
How could I troubleshoot what was going wrong? I read that this is done by the Endpoints controller, but I couldn't find any info about where to check its logs.
Update: We managed to reproduce this a couple more times. Sometimes it was less severe, for example 381 endpoints instead of 445. One interesting thing we noticed is that if we retrieved the details of the endpoints:
$ k describe endpoints my-app
Name: my-app
Namespace: production
Labels: app=my-app
Annotations: <none>
Subsets:
Addresses: 10.36.0.5,10.36.1.5,10.36.10.5,...
NotReadyAddresses: 10.36.199.5,10.36.209.5,10.36.239.2,...
Then a bunch of IPs were "stuck" in the NotReadyAddresses
state (not the ones that were "missing" from the service though, if I summed the number of IPs in Addresses
and NotReadyAddresses
, that was still less than the total number of ready pods). Although I don't know if this is related at all, I couldn't find much info online about this NotReadyAddresses
field.
It turned out that this is caused by using preemptible VMs in our node pools, it doesn't happen if the nodes are not preemtibles.
We couldn't figure out more details of the root cause, but using preemtibles as the nodes is not an officially supported scenario anyway, so we switched to regular VMs.
Pod IPs can be added to NotReadyAddresses
if a health/readiness probe is failing. This will in turn cause the pod IP to fail to be automatically added to the endpoints, meaning that the kubernetes service can't connect to the pod.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With