Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GKE autoscaling doesn't scale down

We use GKE (Google Kubernetes Engine) to run Airflow in a GCC (Google Cloude Composer) for our data pipeline.

We started out with 6 nodes, and realised that the costs spiked, and we didn't use that much CPU. So we thought that we could lower the max, but also enable auto-scaling.

Since we run the pipeline during the night, and only smaller jobs during the day we wanted to run auto-scaling between 1-3 nodes.

So on the GKE node pool we enabled autoscaling, but not on the GCE instance group as they recommend. However, we get this: Node pool does not scale

Why is this?

Below is a graph of our CPU utilisation over the past 4 days: enter image description here

We never pass 20% usage, so why doesn't it scale down?

This morning we manually scaled it down to 3 nodes..

like image 283
Andreas Rolén Avatar asked Jan 26 '23 22:01

Andreas Rolén


2 Answers

The first thing I want to mention is that the scale-down process is triggered when exist underutilized nodes in the cluster. In this context, "underutilized" is not only related with CPU usage only so your reasoning is not completely right.

As documentation says, the condition is that the sum of cpu and memory requests of all pods running on this node is smaller than the utilization threshold defined for the Autoscaler. Then, "if a node is unneeded for more than 10 minutes, it will be terminated". See this for more information.

Also it is important to know that there are some other factors that could prevent the scale-down process, as for instance the node auto-provisioning limits. Check this for more info about pods that can prevent cluster autoscaler from removing a node.

like image 103
Alex6Zam Avatar answered Feb 11 '23 23:02

Alex6Zam


Cloud Composer does not yet (as of 2019/08/26) support the GKE cluster autoscaler, because the cluster autoscaler makes scaling decisions based on the resource requests of Pods, as well as how many Pods are in the unschedulable state (more information here). Composer deploys a fixed number of Pods, which means the autoscaling mechanism doesn't force any scaling action unless you yourself are deploying your own workloads into the cluster.

Autoscaling is also difficult to do because the actual resource usage that an Airflow worker or scheduler depends on how many DAGs you upload (into GCS, in Composer's case), meaning there is no accurate estimate of how much CPU/memory your Airflow processes will use. That means you have no idea how to decide on Pod resource requests of the Airflow Pods.


In the absence of autoscaling, there are still many options for dynamic resource allocation. For example, you can use KubernetesPodOperator to deploy Pods with resource requests into a different Kubernetes cluster that does have autoscaling enabled. Alternatively, you can use the GCE operators to add instances to your cluster before launching more resource-heavy workloads.

like image 38
hexacyanide Avatar answered Feb 11 '23 23:02

hexacyanide