I am running a cluster on GKE and sometimes I get into a hanging state. Right now I was working with just two nodes and allowed the cluster to autoscale. One of the nodes has a NotReady status and simply stays in it. Because of that, half of my pods are Pending, because of insufficient CPU. <h3>How I got there</h3> I deployed a pod which has quite high CPU usage from the moment it starts. When I scaled it to 2, I noticed CPU usage was at 1.0; the moment I scaled the Deployment to 3 replicas, I expected to have the third one in Pending state until the cluster adds another node, then schedule it there. What happened instead is the node switched to a <code>NotReady</code> status and all pods that were on it are now Pending. However, the node does not restart or anything - it is just not used by Kubernetes. The GKE then thinks that there are enough resources as the VM has 0 CPU usage and won't scale up to 3. I cannot manually SSH into the instance from console - it is stuck in the loading loop. I can manually delete the instance and then it starts working - but I don't think that's the idea of fully managed. One thing I noticed - not sure if related: in GCE console, when I look at VM instances, the Ready node is being used by the instance group and the load balancer (which is the service around an nginx entry point), but the NotReady node is only in use by the instance group - not the load balancer. Furthermore, in <code>kubectl get events</code>, there was a line： <pre class="prettyprint"><code>Warning CreatingLoadBalancerFailed {service-controller } Error creating load balancer (will retry): Failed to create load balancer for service default/proxy-service: failed to ensure static IP 104.199.xx.xx: error creating gce static IP address: googleapi: Error 400: Invalid value for field 'resource.address': '104.199.xx.xx'. Specified IP address is already reserved., invalid </code></pre> I specified <code>loadBalancerIP: 104.199.xx.xx</code> in the definition of the proxy-service to make sure that on each restart the service gets the same (reserved) static IP. Any ideas on how to prevent this from happening? So that if a node gets stuck in NotReady state it at least restarts - but ideally doesn't get into such state to begin with? Thanks.

The first thing I would do is to define Resources and Limits for those pods. Resources tell the cluster how much memory and CPU you think that the pod is going to use. You do this to help the scheduler to find the best location to run those pods. Limits are crucial here: they are set to prevent your pods damaging the stability of the nodes. It's better to have a pod killed by an OOM than a pod bringing a node down because of resource starvation. For example, in this case you're saying that you want 200m CPU (20%) for your pod but if for any chance it goes above 300 (30%), you want the scheduler to kill it and start a new one. <pre class="prettyprint"><code>spec: containers: - image: nginx imagePullPolicy: Always name: nginx resources: limits: cpu: 300m memory: 200Mi requests: cpu: 200m memory: 100Mi </code></pre> You can read more here: http://kubernetes.io/docs/admin/limitrange/

Node not ready, pods pending

Tags:

google-cloud-platform

kubernetes

google-kubernetes-engine

I am running a cluster on GKE and sometimes I get into a hanging state. Right now I was working with just two nodes and allowed the cluster to autoscale. One of the nodes has a NotReady status and simply stays in it. Because of that, half of my pods are Pending, because of insufficient CPU.

How I got there

I deployed a pod which has quite high CPU usage from the moment it starts. When I scaled it to 2, I noticed CPU usage was at 1.0; the moment I scaled the Deployment to 3 replicas, I expected to have the third one in Pending state until the cluster adds another node, then schedule it there.
What happened instead is the node switched to a NotReady status and all pods that were on it are now Pending. However, the node does not restart or anything - it is just not used by Kubernetes. The GKE then thinks that there are enough resources as the VM has 0 CPU usage and won't scale up to 3. I cannot manually SSH into the instance from console - it is stuck in the loading loop.

I can manually delete the instance and then it starts working - but I don't think that's the idea of fully managed.

One thing I noticed - not sure if related: in GCE console, when I look at VM instances, the Ready node is being used by the instance group and the load balancer (which is the service around an nginx entry point), but the NotReady node is only in use by the instance group - not the load balancer.

Furthermore, in kubectl get events, there was a line：

Warning   CreatingLoadBalancerFailed   {service-controller }          Error creating load balancer (will retry): Failed to create load balancer for service default/proxy-service: failed to ensure static IP 104.199.xx.xx: error creating gce static IP address: googleapi: Error 400: Invalid value for field 'resource.address': '104.199.xx.xx'. Specified IP address is already reserved., invalid

I specified loadBalancerIP: 104.199.xx.xx in the definition of the proxy-service to make sure that on each restart the service gets the same (reserved) static IP.

Any ideas on how to prevent this from happening? So that if a node gets stuck in NotReady state it at least restarts - but ideally doesn't get into such state to begin with?

Thanks.

473

asked Nov 17 '16 13:11

Robert Lacok

1 Answers

The first thing I would do is to define Resources and Limits for those pods.

Resources tell the cluster how much memory and CPU you think that the pod is going to use. You do this to help the scheduler to find the best location to run those pods.

Limits are crucial here: they are set to prevent your pods damaging the stability of the nodes. It's better to have a pod killed by an OOM than a pod bringing a node down because of resource starvation.

For example, in this case you're saying that you want 200m CPU (20%) for your pod but if for any chance it goes above 300 (30%), you want the scheduler to kill it and start a new one.

spec:
  containers:
  - image: nginx
    imagePullPolicy: Always
    name: nginx
    resources:
      limits:
        cpu: 300m
        memory: 200Mi
      requests:
        cpu: 200m
        memory: 100Mi

You can read more here: http://kubernetes.io/docs/admin/limitrange/

answered Oct 20 '22 10:10

Ivan Pedrazas

Related questions
                            
                                Kubernetes kubectl shows pods restarts as zero but pods age has changed
                            
                                How can kube-apiserver be restarted? [closed]
                            
                                Not able to login to Kubernetes dashboard using token with service account
                            
                                How to access Logs of Pods in Kubernetes after its deletion
                            
                                How do I upgrade a helm chart with a new values.yaml and keep the previous deployments data?
                            
                                How to execute a 'command with arguments' on a container of 'multi-container pod'?
                            
                                uwsgi master graceful shutdown
                            
                                How does port publishing from a docker container to a kubernetes pod work?
                            
                                Kubernetes: create service vs expose deployment
                            
                                Kubernetes object size limitations
                            
                                Kubernetes pod resolve external kafka hostname in coredns not as hostaliases inside pod
                            
                                Why two ENIs by default in EKS?
                            
                                Class not found: io.kubernetes.client.openapi.models.V1Service
                            
                                How to use WebRTC with RTCPeerConnection on Kubernetes?
                            
                                Connecting IPython notebook to spark master running in different machines
                            
                                How can I debug why a Kubernetes load balancer service isn't responding on a port?
                            
                                How Do I Delete Orphan Kubernetes Pods
                            
                                Kubernetes ConfigMap volume doesn't create file in container
                            
                                Kubernetes/Container Engine: TLS handshake timeout
                            
                                Kubernetes prometheus metrics for running pods and nodes?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With