Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

kubectl commands timeout without details

I'm running a Kubernetes cluster, which has worked fine for several months. Now, today, when I was about to deploy some updates, I get timeouts from the server.

Running $ kubectl get nodes yields

Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)

Running $ kubectl get pods --all-namespaces yields

Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods)

Running $ kubectl get deployments yields

Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.extensions)

Running $ kubectl get svc yields

Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get services)

Running $ kubectl cluster-info yields (note no output after the master)

Kubernetes master is running at https://cluster.mysite.com

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

As I get these timeouts for every command, troubleshooting is impossible.

How can I continue from here to access my servers? I'm using kube-aws, and an AWS CloudFormation VPC.

Thanks for your time.

EDIT:

As per request, I ran $ kubectl get pods -v 7 and after a bunch of cache returns got this:

I0103 16:51:32.196859 25644 round_trippers.go:414] GET cluster.mysite.com/api/v1/nodes
I0103 16:51:32.196888 25644 round_trippers.go:421] Request Headers: 
I0103 16:51:32.196894 25644 round_trippers.go:424]     Accept: application/json
I0103 16:51:32.196899 25644 round_trippers.go:424]     User-Agent: kubectl/v1.8.3 (darwin/amd64) kubernetes/f0efb3c
I0103 16:52:32.239841 25644 round_trippers.go:439]     Response Status: 504 Gateway Timeout in 60044 milliseconds

I also ran $ kubectl cluster-info dump -v 7 and got:

I0103 16:51:32.196888   25644 round_trippers.go:421] Request Headers:
I0103 16:51:32.196894   25644 round_trippers.go:424]     Accept: application/json
I0103 16:51:32.196899   25644 round_trippers.go:424]     User-Agent: kubectl/v1.8.3 (darwin/amd64) kubernetes/f0efb3c
I0103 16:52:32.239841   25644 round_trippers.go:439] Response Status: 504 Gateway Timeout in 60044 milliseconds
I0103 16:52:32.242362   25644 helpers.go:207] server response object: [{
  "metadata": {},
  "status": "Failure",
  "message": "the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)",
  "reason": "Timeout",
  "details": {
    "kind": "nodes",
    "causes": [
      {
        "reason": "UnexpectedServerResponse",
        "message": "{\"metadata\":{},\"status\":\"Failure\",\"message\":\"The list operation against nodes could not be completed at this time, please try again.\",\"reason\":\"ServerTimeout\",\"details\":{\"name\":\"list\",\"kind\":\"nodes\"},\"code\":500}"
      }
    ]
  },
  "code": 504
}]

EDIT 2: Okay, now I'm just getting Unable to connect to the server: EOF on every request and I'm starting to get scared. This is a production cluster and I can't even access it to try to troubleshoot. Anyone have a hint on how to proceed?

EDIT 3: I've gotten as far as realizing that the etcd cluster was not working properly, with 2/3 nodes out of sync. Restarting one node had it properly joining the cluster again, but the second one can't start the services. The services that don't start are:

  • etcdadm-check.service
  • etcdadm-save.service
  • etcdadm-update-status.service
  • [email protected]

The three former ones all give the error etcdadm-check.service: Control process exited, code=exited status=3 and the last one gives [email protected]: Start request repeated too quickly..

Any hints on how to handle this?

Also, after restoring the second etcd, I get Unable to connect to the server: x509: certificate signed by unknown authority when running any kubectl commands. Does this signify data loss? My certificates are still valid for over half a year, and I haven't changed anything about them.

EDIT 4: I still have the etcd-issue, but am following the instructions in camil's answer at this time, will update with the result. However, I solved the issue with the certificates not being valid simply by re-running $ kube-aws render credentials with the proper paths to my intermediate root CA, so that issue is solved.

like image 579
Helge Talvik Söderström Avatar asked Oct 18 '22 00:10

Helge Talvik Söderström


2 Answers

To avoid the timeouts, you can pass this flag --request-timeout='1s'. This will allow further debugging.

I see you are running kube-aws,so it will be safe to terminate the master instances (at least one, if you run multiple masters). The ASG will replace them automatically. You can do this also with the ETCD nodes.

If the issue still persists, then you have to ssh into masters and check the logs and services by running commands like:

journalctl -xe
systemctl status -l kubelet.service
systemctl status -l flanneld.service
systemctl status -l docker.service
rkt list

You can also use this function to debug using kubectl from inside the masters:

kubectl() {
/usr/bin/docker run --rm --net=host \
  -v /etc/resolv.conf:/etc/resolv.conf \
  -v /srv/kube-aws/plugins:/srv/kube-aws/plugins \
  quay.io/coreos/hyperkube:v1.9.0_coreos.0 /hyperkube kubectl "$@"
}

Then try these commands:

kubectl get componentstatus
kubectl cluster-info
kubectl get pods -n kube-system
kubectl get events -n kube-system

Check the connectivity to ETCD from masters

export $(cat /etc/etcd-environment | tr -d "'")

/usr/bin/etcdctl \
--ca-file=/etc/kubernetes/ssl/etcd-trusted-ca.pem \
--cert-file=/etc/kubernetes/ssl/etcd-client.pem \
--key-file=/etc/kubernetes/ssl/etcd-client-key.pem \
--endpoints="${ETCD_ENDPOINTS}" \
cluster-health
like image 98
Camil Avatar answered Oct 21 '22 00:10

Camil


rm -r ~/.kube/cache/discovery worked for me.

My timeout messages looked different than yours though:

E0528 20:32:29.191243    1730 request.go:975] Unexpected error when reading response body: net/http: request canceled (Client.Timeout exceeded while reading body)
like image 38
Victor Basso Avatar answered Oct 21 '22 00:10

Victor Basso