I'm running a Kubernetes cluster, which has worked fine for several months. Now, today, when I was about to deploy some updates, I get timeouts from the server.
Running $ kubectl get nodes
yields
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)
Running $ kubectl get pods --all-namespaces
yields
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods)
Running $ kubectl get deployments
yields
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.extensions)
Running $ kubectl get svc
yields
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get services)
Running $ kubectl cluster-info
yields (note no output after the master)
Kubernetes master is running at https://cluster.mysite.com
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
As I get these timeouts for every command, troubleshooting is impossible.
How can I continue from here to access my servers? I'm using kube-aws
, and an AWS CloudFormation VPC.
Thanks for your time.
EDIT:
As per request, I ran $ kubectl get pods -v 7
and after a bunch of cache returns got this:
I0103 16:51:32.196859 25644 round_trippers.go:414] GET cluster.mysite.com/api/v1/nodes
I0103 16:51:32.196888 25644 round_trippers.go:421] Request Headers:
I0103 16:51:32.196894 25644 round_trippers.go:424] Accept: application/json
I0103 16:51:32.196899 25644 round_trippers.go:424] User-Agent: kubectl/v1.8.3 (darwin/amd64) kubernetes/f0efb3c
I0103 16:52:32.239841 25644 round_trippers.go:439] Response Status: 504 Gateway Timeout in 60044 milliseconds
I also ran $ kubectl cluster-info dump -v 7
and got:
I0103 16:51:32.196888 25644 round_trippers.go:421] Request Headers:
I0103 16:51:32.196894 25644 round_trippers.go:424] Accept: application/json
I0103 16:51:32.196899 25644 round_trippers.go:424] User-Agent: kubectl/v1.8.3 (darwin/amd64) kubernetes/f0efb3c
I0103 16:52:32.239841 25644 round_trippers.go:439] Response Status: 504 Gateway Timeout in 60044 milliseconds
I0103 16:52:32.242362 25644 helpers.go:207] server response object: [{
"metadata": {},
"status": "Failure",
"message": "the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)",
"reason": "Timeout",
"details": {
"kind": "nodes",
"causes": [
{
"reason": "UnexpectedServerResponse",
"message": "{\"metadata\":{},\"status\":\"Failure\",\"message\":\"The list operation against nodes could not be completed at this time, please try again.\",\"reason\":\"ServerTimeout\",\"details\":{\"name\":\"list\",\"kind\":\"nodes\"},\"code\":500}"
}
]
},
"code": 504
}]
EDIT 2:
Okay, now I'm just getting Unable to connect to the server: EOF
on every request and I'm starting to get scared. This is a production cluster and I can't even access it to try to troubleshoot. Anyone have a hint on how to proceed?
EDIT 3: I've gotten as far as realizing that the etcd cluster was not working properly, with 2/3 nodes out of sync. Restarting one node had it properly joining the cluster again, but the second one can't start the services. The services that don't start are:
The three former ones all give the error etcdadm-check.service: Control process exited, code=exited status=3
and the last one gives user@0.service: Start request repeated too quickly.
.
Any hints on how to handle this?
Also, after restoring the second etcd, I get Unable to connect to the server: x509: certificate signed by unknown authority
when running any kubectl
commands. Does this signify data loss? My certificates are still valid for over half a year, and I haven't changed anything about them.
EDIT 4:
I still have the etcd-issue, but am following the instructions in camil's answer at this time, will update with the result. However, I solved the issue with the certificates not being valid simply by re-running $ kube-aws render credentials
with the proper paths to my intermediate root CA, so that issue is solved.
To avoid the timeouts, you can pass this flag --request-timeout='1s'
. This will allow further debugging.
I see you are running kube-aws
,so it will be safe to terminate the master instances (at least one, if you run multiple masters). The ASG will replace them automatically. You can do this also with the ETCD nodes.
If the issue still persists, then you have to ssh into masters and check the logs and services by running commands like:
journalctl -xe
systemctl status -l kubelet.service
systemctl status -l flanneld.service
systemctl status -l docker.service
rkt list
You can also use this function to debug using kubectl
from inside the masters:
kubectl() {
/usr/bin/docker run --rm --net=host \
-v /etc/resolv.conf:/etc/resolv.conf \
-v /srv/kube-aws/plugins:/srv/kube-aws/plugins \
quay.io/coreos/hyperkube:v1.9.0_coreos.0 /hyperkube kubectl "$@"
}
Then try these commands:
kubectl get componentstatus
kubectl cluster-info
kubectl get pods -n kube-system
kubectl get events -n kube-system
Check the connectivity to ETCD from masters
export $(cat /etc/etcd-environment | tr -d "'")
/usr/bin/etcdctl \
--ca-file=/etc/kubernetes/ssl/etcd-trusted-ca.pem \
--cert-file=/etc/kubernetes/ssl/etcd-client.pem \
--key-file=/etc/kubernetes/ssl/etcd-client-key.pem \
--endpoints="${ETCD_ENDPOINTS}" \
cluster-health
rm -r ~/.kube/cache/discovery
worked for me.
My timeout messages looked different than yours though:
E0528 20:32:29.191243 1730 request.go:975] Unexpected error when reading response body: net/http: request canceled (Client.Timeout exceeded while reading body)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With