I have a cluster consisting of 6 servers, 3 masters and 3 workers. Up to this morning everything worked fine, until I removed two workers from the cluster.
Now the internal DNS is not working anymore. I cannot resolve an internal name. Apparently google.com is resolved and I can ping it.
My cluster is running Kubernetes V1.18.2 (calico for networking), installed with kubespray. I can access my services from outside, but when it's time for them to connect to each other, they fail (for example when the UI tries to connect to the database).
I provide below some of the output from the command listed here: https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/
kubectl exec -ti busybox-6899b748d7-pbdk4 -- cat /etc/resolv.conf
nameserver 10.233.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ovh.net
options ndots:5
kubectl exec -ti busybox-6899b748d7-pbdk4 -- nslookup kubernetes.default
Server:         10.233.0.10
Address:        10.233.0.10:53
** server can't find kubernetes.default: NXDOMAIN
*** Can't find kubernetes.default: No answer
command terminated with exit code 1
kubectl exec -ti busybox-6899b748d7-pbdk4 -- nslookup google.com
Server:         10.233.0.10
Address:        10.233.0.10:53
Non-authoritative answer:
Name:   google.com
Address: 172.217.22.142
*** Can't find google.com: No answer
kubectl exec -ti busybox-6899b748d7-pbdk4 -- ping google.com
PING google.com (172.217.22.142): 56 data bytes
64 bytes from 172.217.22.142: seq=0 ttl=52 time=4.409 ms
64 bytes from 172.217.22.142: seq=1 ttl=52 time=4.359 ms
^C
--- google.com ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 4.359/4.384/4.409 ms
kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
NAME                       READY   STATUS    RESTARTS   AGE
coredns-74b594f4c6-5k6kq   1/1     Running   2          6d7h
coredns-74b594f4c6-9ct8x   1/1     Running   0          16m
when I get the logs for the DNS pods: for p in $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name); do kubectl logs --namespace=kube-system $p; done they are full of:
E0522 11:56:22.613704       1 reflector.go:153] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Failed to list *v1.Endpoints: Get https://10.233.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E0522 11:56:33.678487       1 reflector.go:307] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Failed to watch *v1.Service: Get https://10.233.0.1:443/api/v1/services?allowWatchBookmarks=true&resourceVersion=1667490&timeout=8m12s&timeoutSeconds=492&watch=true: dial tcp 10.233.0.1:443: connect: connection refused
E0522 12:19:42.356157       1 reflector.go:307] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Failed to watch *v1.Namespace: Get https://10.233.0.1:443/api/v1/namespaces?allowWatchBookmarks=true&resourceVersion=1667490&timeout=6m39s&timeoutSeconds=399&watch=true: dial tcp 10.233.0.1:443: connect: connection refused
E0522 12:19:42.356327       1 reflector.go:307] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Failed to watch *v1.Service: Get https://10.233.0.1:443/api/v1/services?allowWatchBookmarks=true&resourceVersion=1667490&timeout=6m41s&timeoutSeconds=401&watch=true: dial tcp 10.233.0.1:443: connect: connection refused
The coredns service is up: kubectl get svc --namespace=kube-system
NAME                        TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE
coredns                     ClusterIP   10.233.0.3      <none>        53/UDP,53/TCP,9153/TCP   7d4h
dashboard-metrics-scraper   ClusterIP   10.233.52.242   <none>        8000/TCP                 7d4h
kubernetes-dashboard        ClusterIP   10.233.63.42    <none>        443/TCP                  7d4h
voyager-operator            ClusterIP   10.233.31.206   <none>        443/TCP,56791/TCP        6d5h
The endpoints are exposed: kubectl get ep coredns --namespace=kube-system
NAME      ENDPOINTS                                                    AGE
coredns   10.233.68.9:53,10.233.79.7:53,10.233.68.9:9153 + 3 more...   7d4h
What did I break? How can I fix this?
EDIT: More information requested in the comments: kubectl get pods -n kube-system
NAME                                          READY   STATUS    RESTARTS   AGE
calico-kube-controllers-5d9cfb4bfd-8h7jd      1/1     Running   0          3d14h
calico-node-6w8g6                             1/1     Running   13         4d15h
calico-node-78thq                             1/1     Running   6          7d19h
calico-node-cr4jl                             1/1     Running   23         4d16h
calico-node-g5q99                             1/1     Running   1          3d15h
calico-node-pmss2                             1/1     Running   0          3d15h
calico-node-zw9fk                             1/1     Running   18         4d19h
coredns-74b594f4c6-5k6kq                      1/1     Running   2          6d22h
coredns-74b594f4c6-9ct8x                      1/1     Running   0          15h
dns-autoscaler-7594b8c675-j5jfv               1/1     Running   0          15h
kube-apiserver-kub1                           1/1     Running   42         7d20h
kube-apiserver-kub2                           1/1     Running   1          7d19h
kube-apiserver-kub3                           1/1     Running   33         7d19h
kube-controller-manager-kub1                  1/1     Running   37         7d20h
kube-controller-manager-kub2                  1/1     Running   4          3d15h
kube-controller-manager-kub3                  1/1     Running   55         7d19h
kube-proxy-4dlf8                              1/1     Running   4          4d15h
kube-proxy-4nlhf                              1/1     Running   2          4d15h
kube-proxy-82kkz                              1/1     Running   3          4d15h
kube-proxy-lvsfz                              1/1     Running   0          3d15h
kube-proxy-pmhnx                              1/1     Running   4          4d15h
kube-proxy-wpfnn                              1/1     Running   10         4d15h
kube-scheduler-kub1                           1/1     Running   34         7d20h
kube-scheduler-kub2                           1/1     Running   3          7d19h
kube-scheduler-kub3                           1/1     Running   51         7d19h
kubernetes-dashboard-7dbcd59666-79gxv         1/1     Running   0          3d14h
kubernetes-metrics-scraper-6858b8c44d-g9m9w   1/1     Running   1          5d22h
nginx-proxy-galaxy                            1/1     Running   2          4d15h
nginx-proxy-kub4                              1/1     Running   7          4d19h
nginx-proxy-kub5                              1/1     Running   6          4d16h
nodelocaldns-2dv59                            1/1     Running   0          3d15h
nodelocaldns-9skxm                            1/1     Running   5          4d16h
nodelocaldns-dwg4z                            1/1     Running   4          4d15h
nodelocaldns-nmwwz                            1/1     Running   12         7d19h
nodelocaldns-qkq8n                            1/1     Running   4          4d19h
nodelocaldns-v84jj                            1/1     Running   8          7d19h
voyager-operator-5677998d47-psskf             1/1     Running   10         4d15h
Kubernetes DNS schedules a DNS Pod and Service on the cluster, and configures the kubelets to tell individual containers to use the DNS Service's IP to resolve DNS names. Every Service defined in the cluster (including the DNS server itself) is assigned a DNS name.
I was able to reproduce the scenario.
$ kubectl exec -it busybox -n dev -- nslookup kubernetes.default    
Server:         10.96.0.10
Address:        10.96.0.10:53
** server can't find kubernetes.default: NXDOMAIN
*** Can't find kubernetes.default: No answer
command terminated with exit code 1
$ kubectl exec -it busybox -n dev -- nslookup google.com        
Server:         10.96.0.10
Address:        10.96.0.10:53
Non-authoritative answer:
Name:   google.com
Address: 172.217.168.238
*** Can't find google.com: No answer
$ kubectl exec -it busybox -n dev -- ping google.com    
PING google.com (172.217.168.238): 56 data bytes
64 bytes from 172.217.168.238: seq=0 ttl=52 time=18.425 ms
64 bytes from 172.217.168.238: seq=1 ttl=52 time=27.176 ms
64 bytes from 172.217.168.238: seq=2 ttl=52 time=18.603 ms
64 bytes from 172.217.168.238: seq=3 ttl=52 time=15.445 ms
64 bytes from 172.217.168.238: seq=4 ttl=52 time=16.492 ms
64 bytes from 172.217.168.238: seq=5 ttl=52 time=19.294 ms
^C
--- google.com ping statistics ---
6 packets transmitted, 6 packets received, 0% packet loss
round-trip min/avg/max = 15.445/19.239/27.176 ms
But I followed the same steps using dnsutils image. Which has mentioned in Kubernetes doc. It gives a positive response.
$ kubectl exec -ti dnsutils -n dev -- nslookup kubernetes.default   
Server:         10.96.0.10
Address:        10.96.0.10#53
Name:   kubernetes.default.svc.cluster.local
Address: 10.96.0.1
$ kubectl exec -ti dnsutils -n dev -- nslookup google.com        
Server:         10.96.0.10
Address:        10.96.0.10#53
Non-authoritative answer:
Name:   google.com
Address: 172.217.168.238
Name:   google.com
Address: 2a00:1450:400e:80c::200e
As per my understanding, something wrong with the dnsutils in busybox container here. That's why we're getting this DNS resolve error.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With