Using AWS EKS with t3.medium instances so I have (2 VCPU = 2000 cores and 4gb ram).
Running 6 different apps on the node with these cpu request definitions:
name request replica total-cpu
app#1 300m x2 600
app#2 100m x4 400
app#3 150m x1 150
app#4 300m x1 300
app#5 100m x1 100
app#6 150m x1 150
With basic math I can say whole apps consume 1700m cpu cores. Also I have hpa with 60% cpu limit for app#1 and app#2. So, I am expecting to have just one node, or maybe two nodes (because of kube-system pods), but the cluster always running with 3 nodes. It looks like I understood autoscaling thing wrong.
$ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-*.eu-central-1.compute.internal 221m 11% 631Mi 18%
ip-*.eu-central-1.compute.internal 197m 10% 718Mi 21%
ip-*.eu-central-1.compute.internal 307m 15% 801Mi 23%
As you see it's just using 10-15% of nodes. How can I optimize node scaling? What is the reason to have 3 nodes?
$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
app#1 Deployment/easyinventory-deployment 37%/60% 1 5 3 5d16h
app#2 Deployment/poolinventory-deployment 64%/60% 1 5 4 4d10h
UPDATE #1
I have pod disruption budget for kube-system pods
kubectl create poddisruptionbudget pdb-event --namespace=kube-system --selector k8s-app=event-exporter --max-unavailable 1
kubectl create poddisruptionbudget pdb-fluentd --namespace=kube-system --selector k8s-app=k8s-app: fluentd-gcp-scaler --max-unavailable 1
kubectl create poddisruptionbudget pdb-heapster --namespace=kube-system --selector k8s-app=heapster --max-unavailable 1
kubectl create poddisruptionbudget pdb-dns --namespace=kube-system --selector k8s-app=kube-dns --max-unavailable 1
kubectl create poddisruptionbudget pdb-dnsauto --namespace=kube-system --selector k8s-app=kube-dns-autoscaler --max-unavailable 1
kubectl create poddisruptionbudget pdb-glbc --namespace=kube-system --selector k8s-app=glbc --max-unavailable 1
kubectl create poddisruptionbudget pdb-metadata --namespace=kube-system --selector app=metadata-agent-cluster-level --max-unavailable 1
kubectl create poddisruptionbudget pdb-kubeproxy --namespace=kube-system --selector component=kube-proxy --max-unavailable 1
kubectl create poddisruptionbudget pdb-metrics --namespace=kube-system --selector k8s-app=metrics-server --max-unavailable 1
#source: https://gist.github.com/kenthua/fc06c6ea52a25a51bc07e70c8f781f8f
UPDATE #2
Figured out 3rd node is not always live, k8s scaling down to 2 nodes but after a few minutes, scaling up again to 3 nodes and later down to 2 nodes again and again. kubectl describe nodes
# Node 1
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1010m (52%) 1300m (67%)
memory 3040Mi (90%) 3940Mi (117%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
# Node 2
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1060m (54%) 1850m (95%)
memory 3300Mi (98%) 4200Mi (125%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
UPDATE #3
I0608 11:03:21.965642 1 static_autoscaler.go:192] Starting main loop
I0608 11:03:21.965976 1 utils.go:590] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0608 11:03:21.965996 1 filter_out_schedulable.go:65] Filtering out schedulables
I0608 11:03:21.966120 1 filter_out_schedulable.go:130] 0 other pods marked as unschedulable can be scheduled.
I0608 11:03:21.966164 1 filter_out_schedulable.go:130] 0 other pods marked as unschedulable can be scheduled.
I0608 11:03:21.966175 1 filter_out_schedulable.go:90] No schedulable pods
I0608 11:03:21.966202 1 static_autoscaler.go:334] No unschedulable pods
I0608 11:03:21.966257 1 static_autoscaler.go:381] Calculating unneeded nodes
I0608 11:03:21.966336 1 scale_down.go:437] Scale-down calculation: ignoring 1 nodes unremovable in the last 5m0s
I0608 11:03:21.966359 1 scale_down.go:468] Node ip-*-93.eu-central-1.compute.internal - memory utilization 0.909449
I0608 11:03:21.966411 1 scale_down.go:472] Node ip-*-93.eu-central-1.compute.internal is not suitable for removal - memory utilization too big (0.909449)
I0608 11:03:21.966460 1 scale_down.go:468] Node ip-*-115.eu-central-1.compute.internal - memory utilization 0.987231
I0608 11:03:21.966469 1 scale_down.go:472] Node ip-*-115.eu-central-1.compute.internal is not suitable for removal - memory utilization too big (0.987231)
I0608 11:03:21.966551 1 static_autoscaler.go:440] Scale down status: unneededOnly=false lastScaleUpTime=2020-06-08 09:14:54.619088707 +0000 UTC m=+143849.361988520 lastScaleDownDeleteTime=2020-06-06 17:18:02.104469988 +0000 UTC m=+36.847369765 lastScaleDownFailTime=2020-06-06 17:18:02.104470075 +0000 UTC m=+36.847369849 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I0608 11:03:21.966578 1 static_autoscaler.go:453] Starting scale down
I0608 11:03:21.966667 1 scale_down.go:785] No candidates for scale down
Update #4
According to autoscaler logs, it was ignoring the ip-*145.eu-central-1.compute.internal to scale down, for some reason, I wonder what will happen and terminated the instance from EC2 console directly. And these lines appeared in autoscaler logs:
I0608 11:10:43.747445 1 scale_down.go:517] Finding additional 1 candidates for scale down.
I0608 11:10:43.747477 1 cluster.go:93] Fast evaluation: ip-*-145.eu-central-1.compute.internal for removal
I0608 11:10:43.747540 1 cluster.go:248] Evaluation ip-*-115.eu-central-1.compute.internal for default/app2-848db65964-9nr2m -> PodFitsResources predicate mismatch, reason: Insufficient memory,
I0608 11:10:43.747549 1 cluster.go:248] Evaluation ip-*-93.eu-central-1.compute.internal for default/app2-848db65964-9nr2m -> PodFitsResources predicate mismatch, reason: Insufficient memory,
I0608 11:10:43.747557 1 cluster.go:129] Fast evaluation: node ip-*-145.eu-central-1.compute.internal is not suitable for removal: failed to find place for default/app2-848db65964-9nr2m
I0608 11:10:43.747569 1 scale_down.go:554] 1 nodes found to be unremovable in simulation, will re-check them at 2020-06-08 11:15:43.746773707 +0000 UTC m=+151098.489673532
I0608 11:10:43.747596 1 static_autoscaler.go:440] Scale down status: unneededOnly=false lastScaleUpTime=2020-06-08 09:14:54.619088707 +0000 UTC m=+143849.361988520 lastScaleDownDeleteTime=2020-06-06 17:18:02.104469988 +0000 UTC m=+36.847369765 lastScaleDownFailTime=2020-06-06 17:18:02.104470075 +0000 UTC m=+36.847369849 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
As far as I see, the node is not scaling down because there are no other nodes to fit "app2". But app memory request is 700Mi and at the moment other nodes have enough memory for the app2
$ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-10-0-0-93.eu-central-1.compute.internal 386m 20% 920Mi 27%
ip-10-0-1-115.eu-central-1.compute.internal 298m 15% 794Mi 23%
Still no idea why autoscaler is not moving app2 to one of other available nodes and scale down the ip-*-145.
In Kubernetes, a HorizontalPodAutoscaler automatically updates a workload resource (such as a Deployment or StatefulSet), with the aim of automatically scaling the workload to match demand. Horizontal scaling means that the response to increased load is to deploy more Pods.
Scaling a Kubernetes cluster is updating the cluster by adding nodes to it or removing nodes from it. When you add nodes to a Kubernetes cluster, you are scaling up the cluster, and when you remove nodes from the cluster, you are scaling down the cluster.
Autoscaling is one of the key features in Kubernetes cluster. It is a feature in which the cluster is capable of increasing the number of nodes as the demand for service response increases and decrease the number of nodes as the requirement decreases.
The Cluster Autoscaler loads the state of the entire cluster into memory. This includes the pods, nodes, and node groups. On each scan interval, the algorithm identifies unschedulable pods and simulates scheduling for each node group. Know that tuning these factors in different ways comes with different tradeoffs.
How Pods with resource requests are scheduled.
A request is the amount guaranteed for the container. So the scheduler will not schedule a pod to a node without enough capacity. In your case, the 2 existing nodes already cap their mem (0.9 and 0.98). So ip-*-145 cannot be scaled down otherwise app2 has nowhere to go.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With