GKE Kubernetes Autoscaler - max cluster cpu, memory limit reached

Tags:

google-kubernetes-engine

GKE Autoscaler is not scaling nodes up after 15 nodes (former limit)

I've changed the Min and Max values in Cluster to 17-25

enter image description here However the node count is stuck on 14-15 and is not going up, right now my cluster is full, no more pods can fit in, so every new deployment should trigger node scale up and schedule itself onto the new node, which is not happening.

When I create deployment it is stuck in Pending state with a message:

pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max cluster cpu, memory limit reached

Max cluster cpu, memory limit reached sounds like the maximum node count is somehow still 14-15, how is that possible? Why it is not triggering node scale up?

ClusterAutoscalerStatus:

apiVersion: v1
data:
  status: |+
    Cluster-autoscaler status at 2020-03-10 10:35:39.899329642 +0000 UTC:
    Cluster-wide:
      Health:      Healthy (ready=14 unready=0 notStarted=0 longNotStarted=0 registered=14 longUnregistered=0)
                   LastProbeTime:      2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
                   LastTransitionTime: 2020-03-10 09:49:11.965623459 +0000 UTC m=+4133.007827509
      ScaleUp:     NoActivity (ready=14 registered=14)
                   LastProbeTime:      2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
                   LastTransitionTime: 2020-03-10 08:40:47.775200087 +0000 UTC m=+28.817404126
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
                   LastTransitionTime: 2020-03-10 09:49:49.580623718 +0000 UTC m=+4170.622827779

    NodeGroups:
      Name:        https://content.googleapis.com/compute/v1/projects/project/zones/europe-west4-b/instanceGroups/adjust-scope-bff43e09-grp
      Health:      Healthy (ready=14 unready=0 notStarted=0 longNotStarted=0 registered=14 longUnregistered=0 cloudProviderTarget=14 (minSize=17, maxSize=25))
                   LastProbeTime:      2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
                   LastTransitionTime: 2020-03-10 09:46:19.45614781 +0000 UTC m=+3960.498351857
      ScaleUp:     NoActivity (ready=14 cloudProviderTarget=14)
                   LastProbeTime:      2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
                   LastTransitionTime: 2020-03-10 09:46:19.45614781 +0000 UTC m=+3960.498351857
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
                   LastTransitionTime: 2020-03-10 09:49:49.580623718 +0000 UTC m=+4170.622827779

Deployment is very small! (200m CPU, 256Mi mem) so it will surely fit if new node would be added.

Looks like a bug in nodepool/autoscaler as 15 was my former node count limit, somehow it looks like it still things 15 is top.

EDIT: New nodepool with bigger machines, autoscaling in GKE turned on, still the same issue after some time, even though the nodes are having free resources. Top from nodes:

NAME                                                  CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
gke-infrastructure-n-autoscaled-node--0816b9c6-fm5v   805m         41%    4966Mi          88%       
gke-infrastructure-n-autoscaled-node--0816b9c6-h98f   407m         21%    2746Mi          48%       
gke-infrastructure-n-autoscaled-node--0816b9c6-hr0l   721m         37%    3832Mi          67%       
gke-infrastructure-n-autoscaled-node--0816b9c6-prfw   1020m        52%    5102Mi          90%       
gke-infrastructure-n-autoscaled-node--0816b9c6-s94x   946m         49%    3637Mi          64%       
gke-infrastructure-n-autoscaled-node--0816b9c6-sz5l   2000m        103%   5738Mi          101%      
gke-infrastructure-n-autoscaled-node--0816b9c6-z6dv   664m         34%    4271Mi          75%       
gke-infrastructure-n-autoscaled-node--0816b9c6-zvbr   970m         50%    3061Mi          54%

And yet Still the message 1 max cluster cpu, memory limit reached. This is still happening when updating a deployment, the new version sometimes stuck in Pending because it wont trigger the scale up.

EDIT2: While describing cluster with cloud command, I've found this:

autoscaling:
  autoprovisioningNodePoolDefaults:
    oauthScopes:
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/monitoring
    serviceAccount: default
  enableNodeAutoprovisioning: true
  resourceLimits:
  - maximum: '5'
    minimum: '1'
    resourceType: cpu
  - maximum: '5'
    minimum: '1'
    resourceType: memory

How is this working with autoscaling turned on? It wont trigger scaleup if those are reached? (The sum is already are above that)

511

asked Mar 10 '20 10:03

Josef Korbel

1 Answers

I ran into the same issue and was bashing my head against the wall trying to figure out what was going on. Even support couldn't figure it out.

The issue is that if you enable node auto-provisioning at the cluster level, you are setting the actual min/max cpu and mem allowed for the entire cluster. At first glance the UI seems to be suggesting the min/max cpu and mem you would want per node that is auto-provisioned - but that is not correct. So if for example you wanted a maximum of 100 nodes with 8 CPU per node, then your max CPU should be 800. I know a maximum for the cluster is obviously useful so things don't get out of control, but the way it is presented is not intuitive. Since you actually don't have control over what gets picked for your machine type, don't you think it would be useful to not let kubernetes pick a 100 core machine for a 1 core task? that is what I thought it was asking when I was configuring it.

Node auto-provisioning is useful because if for some reason you have auto-provisioning on your own node pool, sometimes it can't meet your demands due to quota issues, then the cluster level node auto-provisioner will figure out a different node pool machine type that it can provision to meet your demands. In my scenario I was using C2 CPUs and there was a scarcity of those cpus in the region so my node pool stopped auto-scaling.

To make things even more confusing, most people start with specifying their node pool machine type, so they are already used to customzing these limits on a per node basis. But then something stops working like a quota issue you have no idea about so you get desperate and configure the node auto-provisioner at the cluster level but then get totally screwed because you thought you were specifying the limits for the new potential machine type.

Hopefully this helps clear some things up.

answered Oct 02 '22 01:10

Sean Montgomery

Related questions
                            
                                Jenkins service always pending on minikube
                            
                                Optional volume/secret volume in kubernetes?
                            
                                Reload docker logging configuration without daemon restart
                            
                                Understanding --master-ipv4-cidr when provisioning private GKE clusters
                            
                                Shared directory for a kubernetes Deployment between it's replicas
                            
                                How to update secret with "kubectl patch --type='json'"
                            
                                How to tail all logs in a kubernetes cluster
                            
                                Issue in launching minikube on mac
                            
                                If clause in helm chart
                            
                                Is there a 'max-retries' for Kubernetes Jobs?
                            
                                Kubernetes External Load Balancer Service on DigitalOcean
                            
                                How to identify unused secrets in Kubernetes?
                            
                                Helm: Extra newline when using "include" for templating
                            
                                Are you trying to mount a directory onto a file (or vice-versa) with kuberneters/configMap?
                            
                                kubernetes Job in version "v1" cannot be handled as a Job:
                            
                                What Is The Cheapest VM That Can Be Used As An AKS Node?
                            
                                Firebase Admin Go SDK getting x509 certificate error ONLY when running inside Kubernetes
                            
                                Kubernetes Master Worker Node Kubeadm Join issue
                            
                                Create configmaps from files recursively
                            
                                Deployed kubernetes service from cluster is not accessible outside the cluster using node port method

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With