I have a cluster consisting of 6 servers, 3 masters and 3 workers. Up to this morning everything worked fine, until I removed two workers from the cluster.
Now the internal DNS is not working anymore. I cannot resolve an internal name. Apparently google.com is resolved and I can ping it.
My cluster is running Kubernetes V1.18.2 (calico for networking), installed with kubespray. I can access my services from outside, but when it's time for them to connect to each other, they fail (for example when the UI tries to connect to the database).
I provide below some of the output from the command listed here: https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/
kubectl exec -ti busybox-6899b748d7-pbdk4 -- cat /etc/resolv.conf
nameserver 10.233.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ovh.net
options ndots:5
kubectl exec -ti busybox-6899b748d7-pbdk4 -- nslookup kubernetes.default
Server: 10.233.0.10
Address: 10.233.0.10:53
** server can't find kubernetes.default: NXDOMAIN
*** Can't find kubernetes.default: No answer
command terminated with exit code 1
kubectl exec -ti busybox-6899b748d7-pbdk4 -- nslookup google.com
Server: 10.233.0.10
Address: 10.233.0.10:53
Non-authoritative answer:
Name: google.com
Address: 172.217.22.142
*** Can't find google.com: No answer
kubectl exec -ti busybox-6899b748d7-pbdk4 -- ping google.com
PING google.com (172.217.22.142): 56 data bytes
64 bytes from 172.217.22.142: seq=0 ttl=52 time=4.409 ms
64 bytes from 172.217.22.142: seq=1 ttl=52 time=4.359 ms
^C
--- google.com ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 4.359/4.384/4.409 ms
kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE
coredns-74b594f4c6-5k6kq 1/1 Running 2 6d7h
coredns-74b594f4c6-9ct8x 1/1 Running 0 16m
when I get the logs for the DNS pods: for p in $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name); do kubectl logs --namespace=kube-system $p; done they are full of:
E0522 11:56:22.613704 1 reflector.go:153] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Failed to list *v1.Endpoints: Get https://10.233.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E0522 11:56:33.678487 1 reflector.go:307] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Failed to watch *v1.Service: Get https://10.233.0.1:443/api/v1/services?allowWatchBookmarks=true&resourceVersion=1667490&timeout=8m12s&timeoutSeconds=492&watch=true: dial tcp 10.233.0.1:443: connect: connection refused
E0522 12:19:42.356157 1 reflector.go:307] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Failed to watch *v1.Namespace: Get https://10.233.0.1:443/api/v1/namespaces?allowWatchBookmarks=true&resourceVersion=1667490&timeout=6m39s&timeoutSeconds=399&watch=true: dial tcp 10.233.0.1:443: connect: connection refused
E0522 12:19:42.356327 1 reflector.go:307] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Failed to watch *v1.Service: Get https://10.233.0.1:443/api/v1/services?allowWatchBookmarks=true&resourceVersion=1667490&timeout=6m41s&timeoutSeconds=401&watch=true: dial tcp 10.233.0.1:443: connect: connection refused
The coredns service is up: kubectl get svc --namespace=kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
coredns ClusterIP 10.233.0.3 <none> 53/UDP,53/TCP,9153/TCP 7d4h
dashboard-metrics-scraper ClusterIP 10.233.52.242 <none> 8000/TCP 7d4h
kubernetes-dashboard ClusterIP 10.233.63.42 <none> 443/TCP 7d4h
voyager-operator ClusterIP 10.233.31.206 <none> 443/TCP,56791/TCP 6d5h
The endpoints are exposed: kubectl get ep coredns --namespace=kube-system
NAME ENDPOINTS AGE
coredns 10.233.68.9:53,10.233.79.7:53,10.233.68.9:9153 + 3 more... 7d4h
What did I break? How can I fix this?
EDIT: More information requested in the comments: kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-5d9cfb4bfd-8h7jd 1/1 Running 0 3d14h
calico-node-6w8g6 1/1 Running 13 4d15h
calico-node-78thq 1/1 Running 6 7d19h
calico-node-cr4jl 1/1 Running 23 4d16h
calico-node-g5q99 1/1 Running 1 3d15h
calico-node-pmss2 1/1 Running 0 3d15h
calico-node-zw9fk 1/1 Running 18 4d19h
coredns-74b594f4c6-5k6kq 1/1 Running 2 6d22h
coredns-74b594f4c6-9ct8x 1/1 Running 0 15h
dns-autoscaler-7594b8c675-j5jfv 1/1 Running 0 15h
kube-apiserver-kub1 1/1 Running 42 7d20h
kube-apiserver-kub2 1/1 Running 1 7d19h
kube-apiserver-kub3 1/1 Running 33 7d19h
kube-controller-manager-kub1 1/1 Running 37 7d20h
kube-controller-manager-kub2 1/1 Running 4 3d15h
kube-controller-manager-kub3 1/1 Running 55 7d19h
kube-proxy-4dlf8 1/1 Running 4 4d15h
kube-proxy-4nlhf 1/1 Running 2 4d15h
kube-proxy-82kkz 1/1 Running 3 4d15h
kube-proxy-lvsfz 1/1 Running 0 3d15h
kube-proxy-pmhnx 1/1 Running 4 4d15h
kube-proxy-wpfnn 1/1 Running 10 4d15h
kube-scheduler-kub1 1/1 Running 34 7d20h
kube-scheduler-kub2 1/1 Running 3 7d19h
kube-scheduler-kub3 1/1 Running 51 7d19h
kubernetes-dashboard-7dbcd59666-79gxv 1/1 Running 0 3d14h
kubernetes-metrics-scraper-6858b8c44d-g9m9w 1/1 Running 1 5d22h
nginx-proxy-galaxy 1/1 Running 2 4d15h
nginx-proxy-kub4 1/1 Running 7 4d19h
nginx-proxy-kub5 1/1 Running 6 4d16h
nodelocaldns-2dv59 1/1 Running 0 3d15h
nodelocaldns-9skxm 1/1 Running 5 4d16h
nodelocaldns-dwg4z 1/1 Running 4 4d15h
nodelocaldns-nmwwz 1/1 Running 12 7d19h
nodelocaldns-qkq8n 1/1 Running 4 4d19h
nodelocaldns-v84jj 1/1 Running 8 7d19h
voyager-operator-5677998d47-psskf 1/1 Running 10 4d15h
Kubernetes DNS schedules a DNS Pod and Service on the cluster, and configures the kubelets to tell individual containers to use the DNS Service's IP to resolve DNS names. Every Service defined in the cluster (including the DNS server itself) is assigned a DNS name.
I was able to reproduce the scenario.
$ kubectl exec -it busybox -n dev -- nslookup kubernetes.default
Server: 10.96.0.10
Address: 10.96.0.10:53
** server can't find kubernetes.default: NXDOMAIN
*** Can't find kubernetes.default: No answer
command terminated with exit code 1
$ kubectl exec -it busybox -n dev -- nslookup google.com
Server: 10.96.0.10
Address: 10.96.0.10:53
Non-authoritative answer:
Name: google.com
Address: 172.217.168.238
*** Can't find google.com: No answer
$ kubectl exec -it busybox -n dev -- ping google.com
PING google.com (172.217.168.238): 56 data bytes
64 bytes from 172.217.168.238: seq=0 ttl=52 time=18.425 ms
64 bytes from 172.217.168.238: seq=1 ttl=52 time=27.176 ms
64 bytes from 172.217.168.238: seq=2 ttl=52 time=18.603 ms
64 bytes from 172.217.168.238: seq=3 ttl=52 time=15.445 ms
64 bytes from 172.217.168.238: seq=4 ttl=52 time=16.492 ms
64 bytes from 172.217.168.238: seq=5 ttl=52 time=19.294 ms
^C
--- google.com ping statistics ---
6 packets transmitted, 6 packets received, 0% packet loss
round-trip min/avg/max = 15.445/19.239/27.176 ms
But I followed the same steps using dnsutils
image. Which has mentioned in Kubernetes doc. It gives a positive response.
$ kubectl exec -ti dnsutils -n dev -- nslookup kubernetes.default
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: kubernetes.default.svc.cluster.local
Address: 10.96.0.1
$ kubectl exec -ti dnsutils -n dev -- nslookup google.com
Server: 10.96.0.10
Address: 10.96.0.10#53
Non-authoritative answer:
Name: google.com
Address: 172.217.168.238
Name: google.com
Address: 2a00:1450:400e:80c::200e
As per my understanding, something wrong with the dnsutils in busybox container here. That's why we're getting this DNS resolve error.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With