Why kubernetes keeps pods in Error/Completed status - preemptible nodes, GKE

Question

I have an issue with my GKE cluster. I am using two node pools: secondary - with standard set of highmen-n1 nodes, and primary - with preemptible highmem-n1 nodes. Issue is that I have many pods in Error/Completed status which are not cleared by k8s, all ran on preemptible set. THESE PODS ARE NOT JOBS.

GKE documentation says that: "Preemptible VMs are Compute Engine VM instances that are priced lower than standard VMs and provide no guarantee of availability. Preemptible VMs offer similar functionality to Spot VMs, but only last up to 24 hours after creation."

"When Compute Engine needs to reclaim the resources used by preemptible VMs, a preemption notice is sent to GKE. Preemptible VMs terminate 30 seconds after receiving a termination notice." Ref: https://cloud.google.com/kubernetes-engine/docs/how-to/preemptible-vms

And from the kubernetes documentation: "For failed Pods, the API objects remain in the cluster's API until a human or controller process explicitly removes them.

The Pod garbage collector (PodGC), which is a controller in the control plane, cleans up terminated Pods (with a phase of Succeeded or Failed), when the number of Pods exceeds the configured threshold (determined by terminated-pod-gc-threshold in the kube-controller-manager). This avoids a resource leak as Pods are created and terminated over time." Ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection

So, from my understanding every 24 hours this set of nodes is changing, so it kills all the pods running on them and depending on graceful shutdown pods are ending up in Completed or Error state. Nevertheless, kubernetes is not clearing or removing them, so I have tons of pods in mentioned statuses in my cluster, which is not expected at all.

I am attaching screenshots for reference. Pods in Error/Completed State Example kubectl describe pod output: Status: Failed Reason: Terminated Message: Pod was terminated in response to imminent node shutdown.

Apart from that, no events, logs, etc.

GKE version: 1.24.7-gke.900

Both Node pools versions: 1.24.5-gke.600

Did anyone encounter such issue or knows what's going on there? Is there solution to clear it in a different way than creating some script and running it periodically?

I tried digging in into GKE logs, but I couldn't find anything. I also tried to look for the answers in docs, but I've failed.

Nils Müller · Accepted Answer

The given commands does not work for me.

I have created a few manifests that you can apply in your cluster to automatically delete the Pods matching the criteria with a kubernetes CronJob.

https://github.com/tyriis/i-see-dead-pods

this is working for me

kubectl get pods \
          --all-namespaces \
          -o go-template \
          --template='{{range .items}}{{printf "%s %s %s
" .metadata.namespace .metadata.name .status.message}}{{end}}' \
          | grep "Pod was terminated in response to imminent node shutdown." \
          | awk '{print $1, $2}' \
          | xargs -r -n2 kubectl delete pod -n

Why kubernetes keeps pods in Error/Completed status - preemptible nodes, GKE

Tags:

google-cloud-platform

kubernetes

google-kubernetes-engine

Withel

1 Answers

Nils Müller

Recent Activity

Donate For Us

Why kubernetes keeps pods in Error/Completed status - preemptible nodes, GKE

Tags:

google-cloud-platform

kubernetes

google-kubernetes-engine

Withel

1 Answers

Nils Müller

Related questions

Recent Activity

Donate For Us