Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kubernetes DNS Troubleshooting

I am trying to troubleshoot a DNS issue in our K8 cluster v1.19. There are 3 nodes (1 controller, 2 workers) all running vanilla Ubuntu 20.04 using Calico network with Metallb for inbound load balancing. This is all hosted on premise and has full access to the internet. There is also a proxy server (Traefik) in front of it that is handling the SSL to the K8 cluster and other services in the infrastructure.

This issue happened when I upgraded the helm chart for the pod that was/is connecting to the redis pod, but otherwise had been happy to run for the past 36 days.

In the log of one of the pods it is showing an error that it cannot determine where the redis pod(s) is/are:

2020-11-09 00:00:00 [1] [verbose]:      [Cache] Attempting connection to redis.
2020-11-09 00:00:00 [1] [verbose]:      [Cache] Successfully connected to redis.
2020-11-09 00:00:00 [1] [verbose]:      [PubSub] Attempting connection to redis.
2020-11-09 00:00:00 [1] [verbose]:      [PubSub] Successfully connected to redis.
2020-11-09 00:00:00 [1] [warn]:         Secret key is weak. Please consider lengthening it for better security.
2020-11-09 00:00:00 [1] [verbose]:      [Database] Connecting to database...
2020-11-09 00:00:00 [1] [info]:         [Database] Successfully connected .
2020-11-09 00:00:00 [1] [verbose]:      [Database] Ran 0 migration(s).
2020-11-09 00:00:00 [1] [verbose]:      Sending request for public key.
Error: getaddrinfo EAI_AGAIN oct-2020-redis-master
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:67:26) {
  errno: -3001,
  code: 'EAI_AGAIN',
  syscall: 'getaddrinfo',
  hostname: 'oct-2020-redis-master'
}
[ioredis] Unhandled error event: Error: getaddrinfo EAI_AGAIN oct-2020-redis-master
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:67:26)
Error: connect ETIMEDOUT
    at Socket.<anonymous> (/app/node_modules/ioredis/built/redis/index.js:307:37)
    at Object.onceWrapper (events.js:421:28)
    at Socket.emit (events.js:315:20)
    at Socket.EventEmitter.emit (domain.js:486:12)
    at Socket._onTimeout (net.js:483:8)
    at listOnTimeout (internal/timers.js:554:17)
    at processTimers (internal/timers.js:497:7) {
  errorno: 'ETIMEDOUT',
  code: 'ETIMEDOUT',
  syscall: 'connect'
}

I have gone through the steps outlined in https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/

ubuntu@k8-01:~$ kubectl exec -i -t dnsutils -- nslookup kubernetes.default
;; connection timed out; no servers could be reached

command terminated with exit code 1
ubuntu@k8-01:~$ kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
NAME                      READY   STATUS    RESTARTS   AGE
coredns-f9fd979d6-lfm5t   1/1     Running   17         37d
coredns-f9fd979d6-sw2qp   1/1     Running   18         37d
ubuntu@k8-01:~$ kubectl logs --namespace=kube-system -l k8s-app=kube-dns
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/reload: Running configuration MD5 = 3d3f6363f05ccd60e0f885f0eca6c5ff
[INFO] Reloading complete
[INFO] 10.244.210.238:34288 - 28733 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.001300712s
[INFO] 10.244.210.238:44532 - 12032 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.001279312s
[INFO] 10.244.210.235:44595 - 65094 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.000163001s
[INFO] 10.244.210.235:55945 - 20758 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.000141202s
ubuntu@k8-01:~$ kubectl get services --all-namespaces
NAMESPACE     NAME                                               TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                      AGE
default       oct-2020-api                                       ClusterIP      10.107.89.213    <none>          80/TCP                       37d
default       oct-2020-nginx-ingress-controller                  LoadBalancer   10.110.235.175   192.168.2.150   80:30194/TCP,443:31514/TCP   37d
default       oct-2020-nginx-ingress-default-backend             ClusterIP      10.98.147.246    <none>          80/TCP                       37d
default       oct-2020-redis-headless                            ClusterIP      None             <none>          6379/TCP                     37d
default       oct-2020-redis-master                              ClusterIP      10.109.58.236    <none>          6379/TCP                     37d
default       oct-2020-webclient                                 ClusterIP      10.111.204.251   <none>          80/TCP                       37d
default       kubernetes                                         ClusterIP      10.96.0.1        <none>          443/TCP                      37d
kube-system   coredns                                            NodePort       10.101.104.114   <none>          53:31245/UDP                 15h
kube-system   kube-dns                                           ClusterIP      10.96.0.10       <none>          53/UDP,53/TCP,9153/TCP       37d

When I enter the pod:

/app # grep "nameserver" /etc/resolv.conf
nameserver 10.96.0.10
/app # nslookup
BusyBox v1.31.1 () multi-call binary.

Usage: nslookup [-type=QUERY_TYPE] [-debug] HOST [DNS_SERVER]

Query DNS about HOST

QUERY_TYPE: soa,ns,a,aaaa,cname,mx,txt,ptr,any
/app # ping 10.96.0.10
PING 10.96.0.10 (10.96.0.10): 56 data bytes
^C
--- 10.96.0.10 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
/app # nslookup oct-20-redis-master
;; connection timed out; no servers could be reached

Any ideas on troubleshooting would be greatly appreciated.

like image 790
SkywalkerIsNull Avatar asked Oct 25 '25 17:10

SkywalkerIsNull


1 Answers

To answer my own question, I deleted the DNS pods and then it worked again. The command was the following:

kubectl delete pod coredns-f9fd979d6-sw2qp --namespace=kube-system

This doesn't get to the underlying problem of why this is happening, or why K8 isn't detecting that something is wrong with those pods and recreating them. I am going to keep digging into this and put some more instrumenting on the DNS pods to see what it actually is that is causing this problem.

If anyone has any ideas on instrumenting to hook up or look at specifically, that would be appreciated.

like image 111
SkywalkerIsNull Avatar answered Oct 28 '25 08:10

SkywalkerIsNull