GKE Autoscaler is not scaling nodes up after 15 nodes (former limit)
I've changed the Min
and Max
values in Cluster to 17-25
However the node count is stuck on 14-15 and is not going up, right now my cluster is full, no more pods can fit in, so every new deployment should trigger node scale up and schedule itself onto the new node, which is not happening.
When I create deployment it is stuck in Pending
state with a message:
pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max cluster cpu, memory limit reached
Max cluster cpu, memory limit reached sounds like the maximum node count is somehow still 14-15, how is that possible? Why it is not triggering node scale up?
ClusterAutoscalerStatus:
apiVersion: v1
data:
status: |+
Cluster-autoscaler status at 2020-03-10 10:35:39.899329642 +0000 UTC:
Cluster-wide:
Health: Healthy (ready=14 unready=0 notStarted=0 longNotStarted=0 registered=14 longUnregistered=0)
LastProbeTime: 2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
LastTransitionTime: 2020-03-10 09:49:11.965623459 +0000 UTC m=+4133.007827509
ScaleUp: NoActivity (ready=14 registered=14)
LastProbeTime: 2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
LastTransitionTime: 2020-03-10 08:40:47.775200087 +0000 UTC m=+28.817404126
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
LastTransitionTime: 2020-03-10 09:49:49.580623718 +0000 UTC m=+4170.622827779
NodeGroups:
Name: https://content.googleapis.com/compute/v1/projects/project/zones/europe-west4-b/instanceGroups/adjust-scope-bff43e09-grp
Health: Healthy (ready=14 unready=0 notStarted=0 longNotStarted=0 registered=14 longUnregistered=0 cloudProviderTarget=14 (minSize=17, maxSize=25))
LastProbeTime: 2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
LastTransitionTime: 2020-03-10 09:46:19.45614781 +0000 UTC m=+3960.498351857
ScaleUp: NoActivity (ready=14 cloudProviderTarget=14)
LastProbeTime: 2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
LastTransitionTime: 2020-03-10 09:46:19.45614781 +0000 UTC m=+3960.498351857
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
LastTransitionTime: 2020-03-10 09:49:49.580623718 +0000 UTC m=+4170.622827779
Deployment is very small! (200m CPU, 256Mi mem) so it will surely fit if new node would be added.
Looks like a bug in nodepool/autoscaler as 15 was my former node count limit, somehow it looks like it still things 15 is top.
EDIT: New nodepool with bigger machines, autoscaling in GKE turned on, still the same issue after some time, even though the nodes are having free resources. Top from nodes:
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gke-infrastructure-n-autoscaled-node--0816b9c6-fm5v 805m 41% 4966Mi 88%
gke-infrastructure-n-autoscaled-node--0816b9c6-h98f 407m 21% 2746Mi 48%
gke-infrastructure-n-autoscaled-node--0816b9c6-hr0l 721m 37% 3832Mi 67%
gke-infrastructure-n-autoscaled-node--0816b9c6-prfw 1020m 52% 5102Mi 90%
gke-infrastructure-n-autoscaled-node--0816b9c6-s94x 946m 49% 3637Mi 64%
gke-infrastructure-n-autoscaled-node--0816b9c6-sz5l 2000m 103% 5738Mi 101%
gke-infrastructure-n-autoscaled-node--0816b9c6-z6dv 664m 34% 4271Mi 75%
gke-infrastructure-n-autoscaled-node--0816b9c6-zvbr 970m 50% 3061Mi 54%
And yet Still the message 1 max cluster cpu, memory limit reached
. This is still happening when updating a deployment, the new version sometimes stuck in Pending
because it wont trigger the scale up.
EDIT2: While describing cluster with cloud command, I've found this:
autoscaling:
autoprovisioningNodePoolDefaults:
oauthScopes:
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
serviceAccount: default
enableNodeAutoprovisioning: true
resourceLimits:
- maximum: '5'
minimum: '1'
resourceType: cpu
- maximum: '5'
minimum: '1'
resourceType: memory
How is this working with autoscaling turned on? It wont trigger scaleup if those are reached? (The sum is already are above that)
Cluster autoscaler scales down only the nodes that can be safely removed. Scaling up is disabled. The node pool does not scale above the value you specified. Note that cluster autoscaler never automatically scales to zero nodes: One or more nodes must always be available in the cluster to run system Pods.
In the Google Cloud console, go to the Kubernetes Clusters page. Select the name of your cluster to views its Cluster Details page. On the Cluster Details page, click on the Logs tab. On the Logs tab, click on the Autoscaler Logs tab to view the logs.
The Kubernetes Cluster Autoscaler automatically adjusts the number of nodes in your cluster when pods fail or are rescheduled onto other nodes. The Cluster Autoscaler is typically installed as a Deployment in your cluster.
When enabled, the cluster autoscaler algorithm checks for pending pods. The cluster autoscaler requests a newly provisioned node if: 1) there are pending pods due to not having enough available cluster resources to meet their requests and 2) the cluster or node pool has not reached the user-defined maximum node count.
I ran into the same issue and was bashing my head against the wall trying to figure out what was going on. Even support couldn't figure it out.
The issue is that if you enable node auto-provisioning at the cluster level, you are setting the actual min/max cpu and mem allowed for the entire cluster. At first glance the UI seems to be suggesting the min/max cpu and mem you would want per node that is auto-provisioned - but that is not correct. So if for example you wanted a maximum of 100 nodes with 8 CPU per node, then your max CPU should be 800. I know a maximum for the cluster is obviously useful so things don't get out of control, but the way it is presented is not intuitive. Since you actually don't have control over what gets picked for your machine type, don't you think it would be useful to not let kubernetes pick a 100 core machine for a 1 core task? that is what I thought it was asking when I was configuring it.
Node auto-provisioning is useful because if for some reason you have auto-provisioning on your own node pool, sometimes it can't meet your demands due to quota issues, then the cluster level node auto-provisioner will figure out a different node pool machine type that it can provision to meet your demands. In my scenario I was using C2 CPUs and there was a scarcity of those cpus in the region so my node pool stopped auto-scaling.
To make things even more confusing, most people start with specifying their node pool machine type, so they are already used to customzing these limits on a per node basis. But then something stops working like a quota issue you have no idea about so you get desperate and configure the node auto-provisioner at the cluster level but then get totally screwed because you thought you were specifying the limits for the new potential machine type.
Hopefully this helps clear some things up.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With