In some cases, we have Services that get no response when trying to access them. Eg Chrome shows ERR_EMPTY_RESPONSE, and occasionally we get other errors as well, like 408, which I'm fairly sure is returned from the ELB, not our application itself. After a long involved investigation, including ssh'ing into the nodes themselves, experimenting with load balancers and more, we are still unsure at which layer the problem actually exists: either in Kubernetes itself, or in the backing services from Amazon EKS (ELB or otherwise) <ul> <li>It seems that only the instance (data) port of the node is the one that has the issue. The problems seems to come and go intermittently, which makes us believe it is not something obvious in our kubernetes manifest or docker configurations, but rather something else in the underlying infrastructure. Sometimes the service & pod will be working, but come back and the morning it will be broken. This leads us to believe that the issue stems from a redistribution of the pods in kubernetes, possibly triggered by something in AWS (load balancer changing, auto-scaling group changes, etc) or something in kubernetes itself when it redistributes pods for other reasons.</li> <li>In all cases we have seen, the health check port continues to work without issue, which is why kubernetes and aws both thing that everything is ok and do not report any failures.</li> <li>We have seen some pods on a node work, while others do not on that same node.</li> <li>We have verified kube-proxy is running and that the iptables-save output is the "same" between two pods that are working. (the same meaning that everything that is not unique, like ip addresses and ports are the same, and consistent with what they should be relative to each other). (we used these instructions to help with these instructions: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/#is-the-kube-proxy-working </li> <li>From ssh on the node itself, for a pod that is failing, we CAN access the pod (ie the application itself) via all possible ip/ports that are expected. <ul> <li>the 10. address of the node itself, on the instance data port.</li> <li>the 10. address of the pod (docker container) on the application port.</li> <li>the 172. address of the ??? on the application port (we are not sure what that ip is, or how the ip route gets to it, as it is a different subnet than the 172 address of the docker0 interface).</li> </ul> </li> <li>From ssh on another node, for a pod that is failing, we cannot access the failing pod on any ports (ERR_EMPTY_RESPONSE). This seems to be the same behaviour as the service/load balancer.</li> </ul> What else could cause behaviour like this?

After much investigation, we were fighting a number of issues: * Our application didn't always behave the way we were expecting. Always check that first. * In our Kubernetes Service manifest, we had set the <code>externalTrafficPolicy: Local</code>, which probably should work, but was causing us problems. (This was with using Classic Load Balancer) <code>service.beta.kubernetes.io/aws-load-balancer-type: "clb"</code>. So if you have problems with CLB, either remove the <code>externalTrafficPolicy</code> or explicitly set it to the default "Cluster" value. So our manifest is now: <code> kind: Service apiVersion: v1 metadata: name: apollo-service annotations: service.beta.kubernetes.io/aws-load-balancer-type: "clb" service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "arn:aws:acm:REDACTED" service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443" service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "http" spec: externalTrafficPolicy: Cluster selector: app: apollo ports: - name: http protocol: TCP port: 80 targetPort: 80 - name: https protocol: TCP port: 443 targetPort: 80 type: LoadBalancer </code>

Why does my Kubernetes Service only work sometimes on EKS?

Tags:

kubernetes

kubernetes-service

amazon-eks

In some cases, we have Services that get no response when trying to access them. Eg Chrome shows ERR_EMPTY_RESPONSE, and occasionally we get other errors as well, like 408, which I'm fairly sure is returned from the ELB, not our application itself.

After a long involved investigation, including ssh'ing into the nodes themselves, experimenting with load balancers and more, we are still unsure at which layer the problem actually exists: either in Kubernetes itself, or in the backing services from Amazon EKS (ELB or otherwise)

It seems that only the instance (data) port of the node is the one that has the issue. The problems seems to come and go intermittently, which makes us believe it is not something obvious in our kubernetes manifest or docker configurations, but rather something else in the underlying infrastructure. Sometimes the service & pod will be working, but come back and the morning it will be broken. This leads us to believe that the issue stems from a redistribution of the pods in kubernetes, possibly triggered by something in AWS (load balancer changing, auto-scaling group changes, etc) or something in kubernetes itself when it redistributes pods for other reasons.
In all cases we have seen, the health check port continues to work without issue, which is why kubernetes and aws both thing that everything is ok and do not report any failures.
We have seen some pods on a node work, while others do not on that same node.
We have verified kube-proxy is running and that the iptables-save output is the "same" between two pods that are working. (the same meaning that everything that is not unique, like ip addresses and ports are the same, and consistent with what they should be relative to each other). (we used these instructions to help with these instructions: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/#is-the-kube-proxy-working
From ssh on the node itself, for a pod that is failing, we CAN access the pod (ie the application itself) via all possible ip/ports that are expected.
- the 10. address of the node itself, on the instance data port.
- the 10. address of the pod (docker container) on the application port.
- the 172. address of the ??? on the application port (we are not sure what that ip is, or how the ip route gets to it, as it is a different subnet than the 172 address of the docker0 interface).
From ssh on another node, for a pod that is failing, we cannot access the failing pod on any ports (ERR_EMPTY_RESPONSE). This seems to be the same behaviour as the service/load balancer.

What else could cause behaviour like this?

997

asked Aug 03 '18 12:08

Ben

2 Answers

After much investigation, we were fighting a number of issues: * Our application didn't always behave the way we were expecting. Always check that first. * In our Kubernetes Service manifest, we had set the externalTrafficPolicy: Local, which probably should work, but was causing us problems. (This was with using Classic Load Balancer) service.beta.kubernetes.io/aws-load-balancer-type: "clb". So if you have problems with CLB, either remove the externalTrafficPolicy or explicitly set it to the default "Cluster" value.

So our manifest is now: kind: Service apiVersion: v1 metadata: name: apollo-service annotations: service.beta.kubernetes.io/aws-load-balancer-type: "clb" service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "arn:aws:acm:REDACTED" service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443" service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "http" spec: externalTrafficPolicy: Cluster selector: app: apollo ports: - name: http protocol: TCP port: 80 targetPort: 80 - name: https protocol: TCP port: 443 targetPort: 80 type: LoadBalancer

answered Oct 18 '22 07:10

Ben

adding

service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443"
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "http"

Fixed this for me

answered Oct 18 '22 08:10

dank

Related questions
                            
                                GKE container killed by 'Memory cgroup out of memory' but monitoring, local testing and pprof shows usage far below limit
                            
                                Running Kubernetes on vCenter
                            
                                how to extend environment variable for a container in Kubernetes
                            
                                GKE kubernetes delayed_job pod logs
                            
                                How to determine if a job is failed
                            
                                Use kubectl context in kubernetes client-go
                            
                                Kubernetes Node Memory Limits
                            
                                Can Ambassador handle CORS requests?
                            
                                EKS not able to authenticate to Kubernetes with Kubectl - "User: is not authorized to perform: sts:AssumeRole"
                            
                                K8S - using Prometheus to monitor another prometheus instance in secure way
                            
                                How to use connection hooks with `KubernetesPodOperator` as environment variables on Apache Airflow on GCP Cloud Composer
                            
                                Programmatic access from a service account to a Google IAP protected resource denied with invalid signature error
                            
                                How to run an etcd cluster among pod replicas?
                            
                                hyperkube doesn't start any manifest from /etc/kubernetes/manifests
                            
                                kube-dns keeps restarting with kubenetes on coreos
                            
                                php-fpm container livenessProbe with /ping route
                            
                                Minikube service URL not working
                            
                                kube-apiserver not authenticating correctly in multi master cluster
                            
                                Requests timing out when accesing a Kubernetes clusterIP service
                            
                                Kubernetes Permission denied for mounted nfs volume

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With