Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does my Kubernetes Service only work sometimes on EKS?

In some cases, we have Services that get no response when trying to access them. Eg Chrome shows ERR_EMPTY_RESPONSE, and occasionally we get other errors as well, like 408, which I'm fairly sure is returned from the ELB, not our application itself.

After a long involved investigation, including ssh'ing into the nodes themselves, experimenting with load balancers and more, we are still unsure at which layer the problem actually exists: either in Kubernetes itself, or in the backing services from Amazon EKS (ELB or otherwise)

  • It seems that only the instance (data) port of the node is the one that has the issue. The problems seems to come and go intermittently, which makes us believe it is not something obvious in our kubernetes manifest or docker configurations, but rather something else in the underlying infrastructure. Sometimes the service & pod will be working, but come back and the morning it will be broken. This leads us to believe that the issue stems from a redistribution of the pods in kubernetes, possibly triggered by something in AWS (load balancer changing, auto-scaling group changes, etc) or something in kubernetes itself when it redistributes pods for other reasons.
  • In all cases we have seen, the health check port continues to work without issue, which is why kubernetes and aws both thing that everything is ok and do not report any failures.
  • We have seen some pods on a node work, while others do not on that same node.
  • We have verified kube-proxy is running and that the iptables-save output is the "same" between two pods that are working. (the same meaning that everything that is not unique, like ip addresses and ports are the same, and consistent with what they should be relative to each other). (we used these instructions to help with these instructions: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/#is-the-kube-proxy-working
  • From ssh on the node itself, for a pod that is failing, we CAN access the pod (ie the application itself) via all possible ip/ports that are expected.
    • the 10. address of the node itself, on the instance data port.
    • the 10. address of the pod (docker container) on the application port.
    • the 172. address of the ??? on the application port (we are not sure what that ip is, or how the ip route gets to it, as it is a different subnet than the 172 address of the docker0 interface).
  • From ssh on another node, for a pod that is failing, we cannot access the failing pod on any ports (ERR_EMPTY_RESPONSE). This seems to be the same behaviour as the service/load balancer.

What else could cause behaviour like this?

like image 997
Ben Avatar asked Aug 03 '18 12:08

Ben


People also ask

Is EKS and Kubernetes the same?

The EKS service sets up and manages the Kubernetes control plane for you. Kubernetes is used to automate the deployment, scaling, and management of your container-based applications. EKS maintains resilience for the Kubernetes control plane by replicating it across multiple Availability Zones.

How do services work in Kubernetes?

A Kubernetes service is a logical abstraction for a deployed group of pods in a cluster (which all perform the same function). Since pods are ephemeral, a service enables a group of pods, which provide specific functions (web services, image processing, etc.) to be assigned a name and unique IP address (clusterIP).

How long does it take to provision an EKS cluster?

Provision the EKS cluster This process should take approximately 10 minutes. Upon completion, Terraform will print your configuration's outputs.


2 Answers

After much investigation, we were fighting a number of issues: * Our application didn't always behave the way we were expecting. Always check that first. * In our Kubernetes Service manifest, we had set the externalTrafficPolicy: Local, which probably should work, but was causing us problems. (This was with using Classic Load Balancer) service.beta.kubernetes.io/aws-load-balancer-type: "clb". So if you have problems with CLB, either remove the externalTrafficPolicy or explicitly set it to the default "Cluster" value.

So our manifest is now: kind: Service apiVersion: v1 metadata: name: apollo-service annotations: service.beta.kubernetes.io/aws-load-balancer-type: "clb" service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "arn:aws:acm:REDACTED" service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443" service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "http"
spec: externalTrafficPolicy: Cluster selector: app: apollo ports: - name: http protocol: TCP port: 80 targetPort: 80 - name: https protocol: TCP port: 443 targetPort: 80 type: LoadBalancer

like image 79
Ben Avatar answered Oct 18 '22 07:10

Ben


adding

service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443"
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "http"

Fixed this for me

like image 37
dank Avatar answered Oct 18 '22 08:10

dank