Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fix Openshift pod start failed with NodeUnderDiskPressure?

I start openshift locally using

oc cluster up

Then I create a pod using hello-pod.json with this command

oc create -f examples/hello-openshift/hello-pod.json

The pod is created but it can't start. Openshift shows an error:

Reason: Failed Scheduling

Message:  0/1 nodes are available: 1 NodeUnderDiskPressure.

I still have plenty of free space on my hard drive. I don't know where to look for other log. How to fix the problem?

like image 843
Sean Nguyen Avatar asked Jan 16 '18 22:01

Sean Nguyen


People also ask

Which command can you use to troubleshoot a pod that will not start up in OpenShift?

You can then run a command, such as curl , against the application from inside the container: $ oc get pods NAME READY STATUS RESTARTS AGE nbviewer-1-wgzdb 1/1 Running 0 3h nbviewer-debug 1/1 Running 0 1m $ oc rsh nbviewer-debug (app-root)sh-4.2$ curl $HOSTNAME:8080 ...

Can we restart a pod in OpenShift?

If a container on a pod fails and the restart policy is set to OnFailure , the pod stays on the node and the container is restarted. If you do not want the container to restart, use a restart policy of Never . If an entire pod fails, OpenShift Container Platform starts a new pod.


2 Answers

In my case an adjustment of node-config.yaml fixed the issue:

1) Search for the generated file node-config.yaml e.g. under /var/lib/origin/ or your custom config path.

2) Open in editor and search the kubeletArguments and add your wished disk eviction police:

kubeletArguments:
  eviction-hard:
  - memory.available<100Mi
  - nodefs.available<1%
  - nodefs.inodesFree<1%
  - imagefs.available<1%

A detailed description can be found here: OpenShift Documentation - Default Hard Eviction Thresholds

like image 200
toschneck Avatar answered Sep 22 '22 06:09

toschneck


Basically I just had to restore the fileSystem for docker and kubernetes configuration in my home user directory.

$ oc cluster down
$ sudo systemctl stop docker
$ sudo rm -rf /var/lib/docker
$ rm -rf ~/.kube
$ sudo systemctl start docker
$ oc cluster up

DONE! -- I was able to create pods after this.

Here are some other things that I had tried while identifying the same NodeUnderDiskPressure that might help you if this doesn't solve the problem:

First I retrieved the available nodes from kubectl by:

$ oc login -u system:admin
$ kubectl get nodes
NAME        STATUS    AGE       VERSION
localhost   Ready     12h       v1.7.6+a08f5eeb62

Next I retrieved the description for the localhost node:

$ kubectl describe node localhost
Name:           localhost
Role:           
Labels:         beta.kubernetes.io/arch=amd64
            beta.kubernetes.io/os=linux
            kubernetes.io/hostname=localhost
Annotations:        volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:         <none>
CreationTimestamp:  Mon, 05 Mar 2018 20:00:20 -0600
Conditions:
  Type          Status  LastHeartbeatTime           LastTransitionTime          Reason              Message
  ----          ------  -----------------           ------------------          ------              -------
  OutOfDisk         False   Tue, 06 Mar 2018 08:09:03 -0600     Mon, 05 Mar 2018 20:00:20 -0600     KubeletHasSufficientDisk    kubelet has sufficient disk space available
  MemoryPressure    False   Tue, 06 Mar 2018 08:09:03 -0600     Mon, 05 Mar 2018 20:00:20 -0600     KubeletHasSufficientMemory  kubelet has sufficient memory available
  DiskPressure      True    Tue, 06 Mar 2018 08:09:03 -0600     Mon, 05 Mar 2018 20:00:31 -0600     KubeletHasDiskPressure      kubelet has disk pressure
  Ready         True    Tue, 06 Mar 2018 08:09:03 -0600     Mon, 05 Mar 2018 20:00:31 -0600     KubeletReady            kubelet is posting ready status
Addresses:
  InternalIP:   192.168.0.14
  Hostname: localhost
Capacity:
 cpu:       4
 memory:    16311024Ki
 pods:      40
Allocatable:
 cpu:       4
 memory:    16208624Ki
 pods:      40
System Info:
 Machine ID:            6895f77789824d26acef6d0db236319f
 System UUID:           248A664C-33F8-11B2-A85C-FC31558EDC86
 Boot ID:           1a5cc22b-81f1-4b07-b26f-917a7d17936f
 Kernel Version:        4.13.16-100.fc25.x86_64
 OS Image:          CentOS Linux 7 (Core)
 Operating System:      linux
 Architecture:          amd64
 Container Runtime Version: docker://1.12.6
 Kubelet Version:       v1.7.6+a08f5eeb62
 Kube-Proxy Version:        v1.7.6+a08f5eeb62
ExternalID:         localhost
Non-terminated Pods:        (0 in total)
  Namespace         Name        CPU Requests    CPU Limits  Memory Requests Memory Limits
  ---------         ----        ------------    ----------  --------------- -------------
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits  Memory Requests Memory Limits
  ------------  ----------  --------------- -------------
  0 (0%)    0 (0%)      0 (0%)      0 (0%)
Events:
  FirstSeen LastSeen    Count   From            SubObjectPath   Type        Reason          Message
  --------- --------    -----   ----            -------------   --------    ------          -------
  12h       8m      2877    kubelet, localhost          Warning     EvictionThresholdMet    Attempting to reclaim imagefs
  11h       3m      136 kubelet, localhost          Warning     ImageGCFailed       (combined from similar events): wanted to free 3113113190 bytes, but freed 0 bytes space with errors in image deletion: [rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete 933861786d39 (must be forced) - image is being used by stopped container 82eca7ad6fd6"}, rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete bcccfe5352d3 (must be forced) - image is being used by stopped container 9c4ad3dc4b80"}, rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete b7b0dbc4f785 (must be forced) - image is being used by stopped container d388fa17ff84"}, rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete 0129e5e73319 (cannot be forced) - image has dependent child images"}, rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete 725dcfab7d63 (must be forced) - image is being used by stopped container 9eb3a771aa6f"}, rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete 8ec432b4cda3 (cannot be forced) - image is being used by running container a3fe6da22775"}]

There is a few things to note:

  1. DiskPressure Condition Status in True
  2. Events in warning: First I can see the EvictionThreshold Attempting to reclaim imagefs; I can also see the ImageGCFailed with details about the images that can't be disposed.

Here a formatted JSON of the ImageGCFailed message in my case:

(combined from similar events):wanted to free 3113113190 bytes,
but freed 0 bytes space with errors in image deletion:[  
   rpc error:   code = 2 desc = Error response from daemon:{  
      "message":"conflict: unable to delete 933861786d39 (must be forced) - image is being used by stopped container 82eca7ad6fd6"
   },
   rpc error:   code = 2 desc = Error response from daemon:{  
      "message":"conflict: unable to delete bcccfe5352d3 (must be forced) - image is being used by stopped container 9c4ad3dc4b80"
   },
   rpc error:   code = 2 desc = Error response from daemon:{  
      "message":"conflict: unable to delete b7b0dbc4f785 (must be forced) - image is being used by stopped container d388fa17ff84"
   },
   rpc error:   code = 2 desc = Error response from daemon:{  
      "message":"conflict: unable to delete 0129e5e73319 (cannot be forced) - image has dependent child images"
   },
   rpc error:   code = 2 desc = Error response from daemon:{  
      "message":"conflict: unable to delete 725dcfab7d63 (must be forced) - image is being used by stopped container 9eb3a771aa6f"
   },
   rpc error:   code = 2 desc = Error response from daemon:{  
      "message":"conflict: unable to delete 8ec432b4cda3 (cannot be forced) - image is being used by running container a3fe6da22775"
   }
]

Based in this information: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#reclaiming-node-level-resources Now I investigate the available containers and try to remove them manually:

$ docker ps -a
CONTAINER ID        IMAGE                                         COMMAND                  CREATED             STATUS                      PORTS               NAMES
a3fe6da22775        openshift/origin:v3.7.1                       "/usr/bin/openshift s"   12 hours ago        Up 12 hours                                     origin
82eca7ad6fd6        dtf-bpms/nodejs-mongo-persistent-2:4e90f728   "/bin/sh -ic 'npm sta"   3 months ago        Exited (137) 3 months ago                       openshift_s2i-build_nodejs-mongo-persistent-2_dtf-bpms_post-commit_fe89fcfd
9c4ad3dc4b80        dtf-bpms/nodejs-mongo-persistent-2:4e23c7d5   "/bin/sh -ic 'npm tes"   3 months ago        Exited (137) 3 months ago                       openshift_s2i-build_nodejs-mongo-persistent-2_dtf-bpms_post-commit_de141bcd
d388fa17ff84        dtf-bpms/nodejs-mongo-persistent-1:439d35ea   "/bin/sh -ic 'npm tes"   3 months ago        Exited (137) 3 months ago                       openshift_s2i-build_nodejs-mongo-persistent-1_dtf-bpms_post-commit_277b19ca
9eb3a771aa6f        hello-world                                   "/hello"                 3 months ago        Exited (0) 3 months ago                         serene_babbage

Now I will to manually delete all stopped containers:

$ docker rm $(docker ps -a -q)
82eca7ad6fd6
9c4ad3dc4b80
d388fa17ff84
9eb3a771aa6f
Error response from daemon: You cannot remove a running container a3fe6da22775a559fe94ab0eb5f52d55d9aca6d1f950f107d13243fa029e071f. Stop the container before attempting removal or use -f

In this case it is fine to keep the openshift container.

$ docker ps -a
CONTAINER ID        IMAGE                     COMMAND                  CREATED             STATUS              PORTS               NAMES
a3fe6da22775        openshift/origin:v3.7.1   "/usr/bin/openshift s"   12 hours ago        Up 12 hours                             origin

Next I restart openshift and docker and try to create my containers again and described the localhost node:

$ oc cluster down
$ sudo systemctl restart docker
$ oc cluster up
... (wait for cluster up start)
$ [CREATE PROJECT AND CONTAINERS]
$ oc login -u system:admin
$ kubectl describe node localhost
... (node description and header information)
Events:
  FirstSeen LastSeen    Count   From            SubObjectPath   Type        Reason          Message
  --------- --------    -----   ----            -------------   --------    ------          -------
  1h        1h      2   kubelet, localhost          Normal      NodeHasSufficientMemory Node localhost status is now: NodeHasSufficientMemory
  1h        1h      2   kubelet, localhost          Normal      NodeHasNoDiskPressure   Node localhost status is now: NodeHasNoDiskPressure
  1h        1h      1   kubelet, localhost          Normal      NodeAllocatableEnforced Updated Node Allocatable limit across pods
  1h        1h      2   kubelet, localhost          Normal      NodeHasSufficientDisk   Node localhost status is now: NodeHasSufficientDisk
  1h        1h      1   kubelet, localhost          Normal      NodeReady       Node localhost status is now: NodeReady
  1h        1h      1   kubelet, localhost          Normal      NodeHasDiskPressure Node localhost status is now: NodeHasDiskPressure
  1h        1h      1   kubelet, localhost          Warning     ImageGCFailed       wanted to free 2934625894 bytes, but freed 0 bytes space with errors in image deletion: rpc error: code = 2 desc = Error response from daemon: {"message":"conflict: unable to delete 8ec432b4cda3 (cannot be forced) - image is being used by running container 4bcd2196747c"}

You can see I continue to see the NodeHasDiskPressure after cleaning up old unused container and images have been released from the Docker events. HERE IS WHERE NEXT STEP WAS TO DELETE THE OLD DIRTY DOCKER FILE SYSTEM AND START WITH A FRESH ONE.

like image 26
Diego Torres Fuerte Avatar answered Sep 22 '22 06:09

Diego Torres Fuerte