Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GKE Kubernetes Autoscaler - max cluster cpu, memory limit reached

GKE Autoscaler is not scaling nodes up after 15 nodes (former limit)

I've changed the Min and Max values in Cluster to 17-25

enter image description here However the node count is stuck on 14-15 and is not going up, right now my cluster is full, no more pods can fit in, so every new deployment should trigger node scale up and schedule itself onto the new node, which is not happening.

When I create deployment it is stuck in Pending state with a message:

pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max cluster cpu, memory limit reached

Max cluster cpu, memory limit reached sounds like the maximum node count is somehow still 14-15, how is that possible? Why it is not triggering node scale up?

ClusterAutoscalerStatus:

apiVersion: v1
data:
  status: |+
    Cluster-autoscaler status at 2020-03-10 10:35:39.899329642 +0000 UTC:
    Cluster-wide:
      Health:      Healthy (ready=14 unready=0 notStarted=0 longNotStarted=0 registered=14 longUnregistered=0)
                   LastProbeTime:      2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
                   LastTransitionTime: 2020-03-10 09:49:11.965623459 +0000 UTC m=+4133.007827509
      ScaleUp:     NoActivity (ready=14 registered=14)
                   LastProbeTime:      2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
                   LastTransitionTime: 2020-03-10 08:40:47.775200087 +0000 UTC m=+28.817404126
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
                   LastTransitionTime: 2020-03-10 09:49:49.580623718 +0000 UTC m=+4170.622827779

    NodeGroups:
      Name:        https://content.googleapis.com/compute/v1/projects/project/zones/europe-west4-b/instanceGroups/adjust-scope-bff43e09-grp
      Health:      Healthy (ready=14 unready=0 notStarted=0 longNotStarted=0 registered=14 longUnregistered=0 cloudProviderTarget=14 (minSize=17, maxSize=25))
                   LastProbeTime:      2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
                   LastTransitionTime: 2020-03-10 09:46:19.45614781 +0000 UTC m=+3960.498351857
      ScaleUp:     NoActivity (ready=14 cloudProviderTarget=14)
                   LastProbeTime:      2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
                   LastTransitionTime: 2020-03-10 09:46:19.45614781 +0000 UTC m=+3960.498351857
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
                   LastTransitionTime: 2020-03-10 09:49:49.580623718 +0000 UTC m=+4170.622827779

Deployment is very small! (200m CPU, 256Mi mem) so it will surely fit if new node would be added.

Looks like a bug in nodepool/autoscaler as 15 was my former node count limit, somehow it looks like it still things 15 is top.

EDIT: New nodepool with bigger machines, autoscaling in GKE turned on, still the same issue after some time, even though the nodes are having free resources. Top from nodes:

NAME                                                  CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
gke-infrastructure-n-autoscaled-node--0816b9c6-fm5v   805m         41%    4966Mi          88%       
gke-infrastructure-n-autoscaled-node--0816b9c6-h98f   407m         21%    2746Mi          48%       
gke-infrastructure-n-autoscaled-node--0816b9c6-hr0l   721m         37%    3832Mi          67%       
gke-infrastructure-n-autoscaled-node--0816b9c6-prfw   1020m        52%    5102Mi          90%       
gke-infrastructure-n-autoscaled-node--0816b9c6-s94x   946m         49%    3637Mi          64%       
gke-infrastructure-n-autoscaled-node--0816b9c6-sz5l   2000m        103%   5738Mi          101%      
gke-infrastructure-n-autoscaled-node--0816b9c6-z6dv   664m         34%    4271Mi          75%       
gke-infrastructure-n-autoscaled-node--0816b9c6-zvbr   970m         50%    3061Mi          54%

And yet Still the message 1 max cluster cpu, memory limit reached. This is still happening when updating a deployment, the new version sometimes stuck in Pending because it wont trigger the scale up.

EDIT2: While describing cluster with cloud command, I've found this:

autoscaling:
  autoprovisioningNodePoolDefaults:
    oauthScopes:
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/monitoring
    serviceAccount: default
  enableNodeAutoprovisioning: true
  resourceLimits:
  - maximum: '5'
    minimum: '1'
    resourceType: cpu
  - maximum: '5'
    minimum: '1'
    resourceType: memory

How is this working with autoscaling turned on? It wont trigger scaleup if those are reached? (The sum is already are above that)

like image 511
Josef Korbel Avatar asked Mar 10 '20 10:03

Josef Korbel


People also ask

How cluster Autoscaler scale down?

Cluster autoscaler scales down only the nodes that can be safely removed. Scaling up is disabled. The node pool does not scale above the value you specified. Note that cluster autoscaler never automatically scales to zero nodes: One or more nodes must always be available in the cluster to run system Pods.

How do I know if cluster Autoscaler is enabled?

In the Google Cloud console, go to the Kubernetes Clusters page. Select the name of your cluster to views its Cluster Details page. On the Cluster Details page, click on the Logs tab. On the Logs tab, click on the Autoscaler Logs tab to view the logs.

What is Kubernetes cluster Autoscaler?

The Kubernetes Cluster Autoscaler automatically adjusts the number of nodes in your cluster when pods fail or are rescheduled onto other nodes. The Cluster Autoscaler is typically installed as a Deployment in your cluster.

What happens when there are pods in a pending state given that Kubernetes cluster Autoscaler is enabled?

When enabled, the cluster autoscaler algorithm checks for pending pods. The cluster autoscaler requests a newly provisioned node if: 1) there are pending pods due to not having enough available cluster resources to meet their requests and 2) the cluster or node pool has not reached the user-defined maximum node count.


1 Answers

I ran into the same issue and was bashing my head against the wall trying to figure out what was going on. Even support couldn't figure it out.

The issue is that if you enable node auto-provisioning at the cluster level, you are setting the actual min/max cpu and mem allowed for the entire cluster. At first glance the UI seems to be suggesting the min/max cpu and mem you would want per node that is auto-provisioned - but that is not correct. So if for example you wanted a maximum of 100 nodes with 8 CPU per node, then your max CPU should be 800. I know a maximum for the cluster is obviously useful so things don't get out of control, but the way it is presented is not intuitive. Since you actually don't have control over what gets picked for your machine type, don't you think it would be useful to not let kubernetes pick a 100 core machine for a 1 core task? that is what I thought it was asking when I was configuring it.

Node auto-provisioning is useful because if for some reason you have auto-provisioning on your own node pool, sometimes it can't meet your demands due to quota issues, then the cluster level node auto-provisioner will figure out a different node pool machine type that it can provision to meet your demands. In my scenario I was using C2 CPUs and there was a scarcity of those cpus in the region so my node pool stopped auto-scaling.

To make things even more confusing, most people start with specifying their node pool machine type, so they are already used to customzing these limits on a per node basis. But then something stops working like a quota issue you have no idea about so you get desperate and configure the node auto-provisioner at the cluster level but then get totally screwed because you thought you were specifying the limits for the new potential machine type.

Hopefully this helps clear some things up.

like image 90
Sean Montgomery Avatar answered Oct 02 '22 01:10

Sean Montgomery