I am facing a weird issue with my pods. I am launching around 20 pods in my env and every time some random 3-4 pods out of them hang with Init:0/1 status. On checking the status of pod, Init container shows running status, which should terminate after task is finished, and app container shows Waiting/Pod Initializing stage. Same init container image and specs are being used in across all 20 pods but this issue is happening with some random pods every time. And on terminating these stuck pods, it stucks in Terminating state. If i ssh on node at which this pod is launched and run docker ps, it shows me init container in running state but on running docker exec it throws error that container doesn't exist. This init container is pulling configs from Consul Server and on checking volume (got from docker inspect), i found that it has pulled all the key-val pairs correctly and saved it in defined file name. I have checked resources on all the nodes and more than enough is available on all.
Below is detailed example of on the pod acting like this.
Kubectl Version :
kubectl version 
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.0", GitCommit:"925c127ec6b946659ad0fd596fa959be43f0cc05", GitTreeState:"clean", BuildDate:"2017-12-15T21:07:38Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"} 
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2", GitCommit:"5fa2db2bd46ac79e5e00a4e6ed24191080aa463b", GitTreeState:"clean", BuildDate:"2018-01-18T09:42:01Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"} 
Pods :
kubectl get pods -n dev1|grep -i session-service 
session-service-app-75c9c8b5d9-dsmhp               0/1       Init:0/1           0          10h 
session-service-app-75c9c8b5d9-vq98k               0/1       Terminating        0          11h 
Pods Status :
kubectl describe pods session-service-app-75c9c8b5d9-dsmhp -n dev1 
Name:           session-service-app-75c9c8b5d9-dsmhp 
Namespace:      dev1 
Node:           ip-192-168-44-18.ap-southeast-1.compute.internal/192.168.44.18 
Start Time:     Fri, 27 Apr 2018 18:14:43 +0530 
Labels:         app=session-service-app 
                pod-template-hash=3175746185 
                release=session-service-app 
Status:         Pending 
IP:             100.96.4.240 
Controlled By:  ReplicaSet/session-service-app-75c9c8b5d9 
Init Containers: 
  initpullconsulconfig: 
    Container ID:  docker://c658d59995636e39c9d03b06e4973b6e32f818783a21ad292a2cf20d0e43bb02 
    Image:         shr-u-nexus-01.myops.de:8082/utils/app-init:1.0 
    Image ID:      docker-pullable://shr-u-nexus-01.myops.de:8082/utils/app-init@sha256:7b0692e3f2e96c6e54c2da614773bb860305b79922b79642642c4e76bd5312cd 
    Port:          <none> 
    Args: 
      -consul-addr=consul-server.consul.svc.cluster.local:8500 
    State:          Running 
      Started:      Fri, 27 Apr 2018 18:14:44 +0530 
    Ready:          False 
    Restart Count:  0 
    Environment: 
      CONSUL_TEMPLATE_VERSION:  0.19.4 
      POD:                      sand 
      SERVICE:                  session-service-app 
      ENV:                      dev1 
    Mounts: 
      /var/lib/app from shared-volume-sidecar (rw) 
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-bthkv (ro) 
Containers: 
  session-service-app: 
    Container ID: 
    Image:          shr-u-nexus-01.myops.de:8082/sand-images/sessionservice-init:sitv12 
    Image ID: 
    Port:           8080/TCP 
    State:          Waiting 
      Reason:       PodInitializing 
    Ready:          False 
    Restart Count:  0 
    Environment:    <none> 
    Mounts: 
      /etc/appenv from shared-volume-sidecar (rw) 
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-bthkv (ro) 
Conditions: 
  Type           Status 
  Initialized    False 
  Ready          False 
  PodScheduled   True 
Volumes: 
  shared-volume-sidecar: 
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime) 
    Medium: 
  default-token-bthkv: 
    Type:        Secret (a volume populated by a Secret) 
    SecretName:  default-token-bthkv 
    Optional:    false 
QoS Class:       BestEffort 
Node-Selectors:  <none> 
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s 
                 node.kubernetes.io/unreachable:NoExecute for 300s 
Events:          <none> 
Container Status on Node :
sudo docker ps|grep -i session 
c658d5999563        shr-u-nexus-01.myops.de:8082/utils/app-init@sha256:7b0692e3f2e96c6e54c2da614773bb860305b79922b79642642c4e76bd5312cd                                       "/usr/bin/consul-t..."   10 hours ago        Up 10 hours                             k8s_initpullconsulconfig_session-service-app-75c9c8b5d9-dsmhp_dev1_c2075f2a-4a18-11e8-88e7-02929cc89ab6_0 
da120abd3dbb        gcr.io/google_containers/pause-amd64:3.0                                                                                                                      "/pause"                 10 hours ago        Up 10 hours                             k8s_POD_session-service-app-75c9c8b5d9-dsmhp_dev1_c2075f2a-4a18-11e8-88e7-02929cc89ab6_0 
f53d48c7d6ec        shr-u-nexus-01.myops.de:8082/utils/app-init@sha256:7b0692e3f2e96c6e54c2da614773bb860305b79922b79642642c4e76bd5312cd                                       "/usr/bin/consul-t..."   10 hours ago        Up 10 hours                             k8s_initpullconsulconfig_session-service-app-75c9c8b5d9-vq98k_dev1_42837d12-4a12-11e8-88e7-02929cc89ab6_0 
c26415458d39        gcr.io/google_containers/pause-amd64:3.0                                                                                                                      "/pause"                 10 hours ago        Up 10 hours                             k8s_POD_session-service-app-75c9c8b5d9-vq98k_dev1_42837d12-4a12-11e8-88e7-02929cc89ab6_0 
On running Docker exec (same result with kubectl exec) :
sudo docker exec -it c658d5999563 bash 
rpc error: code = 2 desc = containerd: container not found 
What Does CrashLoopBackOff mean? CrashLoopBackOff is a status message that indicates one of your pods is in a constant state of flux—one or more containers are failing and restarting repeatedly. This typically happens because each pod inherits a default restartPolicy of Always upon creation.
If a Pod is Running but not Ready it means that the Readiness probe is failing. When the Readiness probe is failing, the Pod isn't attached to the Service, and no traffic is forwarded to that instance.
A Pod can be stuck in Init status due to many reasons.
PodInitializing or Init Status means that the Pod contains an Init container that hasn't finalized (Init containers: specialized containers that run before app containers in a Pod, init containers can contain utilities or setup scripts). If the pods status is ´Init:0/1´ means one init container is not finalized; init:N/M means the Pod has M Init Containers, and N have completed so far.

For those scenario the best would be to gather information, as the root cause can be different in every PodInitializing issue.
kubectl describe pods pod-XXX with this command you can get many info of the pod, you can check if there's any meaningful event as well. Save the init container name
kubectl logs pod-XXX this command prints the logs for a container in a pod or specified resource.
kubectl logs pod-XXX -c init-container-xxx This is the most accurate as could print the logs of the init container. You can get the init container name describing the pod in order to replace "init-container-XXX" as for example to "copy-default-config" as below:

The output of kubectl logs pod-XXX -c init-container-xxx can thrown meaningful info of the issue, reference:

In the image above we can see that the root cause is that the init container can't download the plugins from jenkins (timeout), here now we can check connection config, proxy, dns; or just modify the yaml to deploy the container without the plugins.
Additional:
kubectl describe node node-XXX describing the pod will give you the name of its node, which you can also inspect with this command.
kubectl get events to list the cluster events.
journalctl -xeu kubelet | tail -n 10 kubelet logs on systemd (journalctl -xeu docker | tail -n 1 for docker).
Solutions
The solutions depends on the information gathered, once the root cause is found.
When you find a log with an insight of the root cause, you can investigate that specific root cause.
Some examples:
1 > In there this happened when init container was deleted, can be fixed deleting the pod so it would be recreated, or redeploy it. Same scenario in 1.1.
2 > If you found "bad address 'kube-dns.kube-system'" the PVC may not be recycled correctly, solution provided in 2 is running /opt/kubernetes/bin/kube-restart.sh.
3 > There, a sh file was not found, the solution would be to modify the yaml file or remove the container if unnecessary.
4 > A FailedSync was found, and it was solved restarting docker on the node.
In general you can modify the yaml, for example to avoid using an outdated URL, try to recreate the affected resource, or just remove the init container that causes the issue from your deployment. However the specific solution will depend on the specific root cause.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With