Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kubernetes pods disappear after failed jobs

I am running Kubernetes jobs via cron. In some cases the jobs may fail and I want them to restart. I'm scheduling the jobs like this:

kubectl run collector-60053 --schedule=30 10 * * * * --image=gcr.io/myimage/collector --restart=OnFailure --command node collector.js

I'm having a problem where some of these jobs are running and failing but the associated pods are disappearing, so I have no way to look at the logs and they are not restarting.

For example:

$ kubectl get jobs | grep 60053
collector-60053-1546943400     1         0            1h
$ kubectl get pods -a | grep 60053
$    // nothing returned

This is on Google Cloud Platform running 1.10.9-gke.5

Any help would be much appreciated!

EDIT:

I discovered some more information. I have auto-scaling setup on my GCP cluster. I noticed that when the servers are removed the pods are also removed (and their meta data). Is that expected behavior? Unfortunately this gives me no easy way to look at the pod logs.

My theory is that as pods fail, the CrashLoopBackOff kicks in and eventually auto-scaling decides that the node is no longer needed (it doesn't see the pod as an active workload). At this point, the node goes away and so do the pods. I don't think this is expected behavior with Restart OnFailure but I basically witnessed this by watching it closely.

like image 234
user1527312 Avatar asked Jan 08 '19 12:01

user1527312


People also ask

What happens when a Kubernetes pod fails?

Pod lifetime If a Node dies, the Pods scheduled to that node are scheduled for deletion after a timeout period. Pods do not, by themselves, self-heal. If a Pod is scheduled to a node that then fails, the Pod is deleted; likewise, a Pod won't survive an eviction due to a lack of resources or Node maintenance.

Why is Kubernetes killing my pod?

The OOMKilled error, also indicated by exit code 137, means that a container or pod was terminated because they used more memory than allowed. OOM stands for “Out Of Memory”. Kubernetes allows pods to limit the resources their containers are allowed to utilize on the host machine.

What happens when readiness probe fails?

Expected Behavior. After a liveness probe fail, the container should restart and ideally should start serving the traffic again, just like how it would happen for a k8s deployment.


1 Answers

After digging much further into this issue, I have an understating of my situation. According to issue 54870 on the Kubernetes repository, there are some problems with jobs when set to Restart=OnFailure.

I have changed my configuration to use Restart=Never and to set a backoffLimit for the job. Even though restart is set to never, in my testing with restart never, Kubernetes will actually restart the pods up to the backoffLimit setting and keep the error pods around for inspection.

like image 181
user1527312 Avatar answered Sep 18 '22 15:09

user1527312