Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Node not ready, pods pending

I am running a cluster on GKE and sometimes I get into a hanging state. Right now I was working with just two nodes and allowed the cluster to autoscale. One of the nodes has a NotReady status and simply stays in it. Because of that, half of my pods are Pending, because of insufficient CPU.

How I got there

I deployed a pod which has quite high CPU usage from the moment it starts. When I scaled it to 2, I noticed CPU usage was at 1.0; the moment I scaled the Deployment to 3 replicas, I expected to have the third one in Pending state until the cluster adds another node, then schedule it there.
What happened instead is the node switched to a NotReady status and all pods that were on it are now Pending. However, the node does not restart or anything - it is just not used by Kubernetes. The GKE then thinks that there are enough resources as the VM has 0 CPU usage and won't scale up to 3. I cannot manually SSH into the instance from console - it is stuck in the loading loop.

I can manually delete the instance and then it starts working - but I don't think that's the idea of fully managed.

One thing I noticed - not sure if related: in GCE console, when I look at VM instances, the Ready node is being used by the instance group and the load balancer (which is the service around an nginx entry point), but the NotReady node is only in use by the instance group - not the load balancer.

Furthermore, in kubectl get events, there was a line:

Warning   CreatingLoadBalancerFailed   {service-controller }          Error creating load balancer (will retry): Failed to create load balancer for service default/proxy-service: failed to ensure static IP 104.199.xx.xx: error creating gce static IP address: googleapi: Error 400: Invalid value for field 'resource.address': '104.199.xx.xx'. Specified IP address is already reserved., invalid

I specified loadBalancerIP: 104.199.xx.xx in the definition of the proxy-service to make sure that on each restart the service gets the same (reserved) static IP.

Any ideas on how to prevent this from happening? So that if a node gets stuck in NotReady state it at least restarts - but ideally doesn't get into such state to begin with?

Thanks.

like image 473
Robert Lacok Avatar asked Nov 17 '16 13:11

Robert Lacok


People also ask

Why is my pod stuck pending?

My pod stays pending If a Pod is stuck in Pending it means that it can not be scheduled onto a node. Generally this is because there are insufficient resources of one type or another that prevent scheduling. Look at the output of the kubectl describe ... command above.

Why are Kubernetes pods not ready?

If a Pod is Running but not Ready it means that the Readiness probe is failing. When the Readiness probe is failing, the Pod isn't attached to the Service, and no traffic is forwarded to that instance.


1 Answers

The first thing I would do is to define Resources and Limits for those pods.

Resources tell the cluster how much memory and CPU you think that the pod is going to use. You do this to help the scheduler to find the best location to run those pods.

Limits are crucial here: they are set to prevent your pods damaging the stability of the nodes. It's better to have a pod killed by an OOM than a pod bringing a node down because of resource starvation.

For example, in this case you're saying that you want 200m CPU (20%) for your pod but if for any chance it goes above 300 (30%), you want the scheduler to kill it and start a new one.

spec:
  containers:
  - image: nginx
    imagePullPolicy: Always
    name: nginx
    resources:
      limits:
        cpu: 300m
        memory: 200Mi
      requests:
        cpu: 200m
        memory: 100Mi

You can read more here: http://kubernetes.io/docs/admin/limitrange/

like image 58
Ivan Pedrazas Avatar answered Oct 20 '22 10:10

Ivan Pedrazas