Kubernetes pods disappear after failed jobs

Tags:

google-kubernetes-engine

I am running Kubernetes jobs via cron. In some cases the jobs may fail and I want them to restart. I'm scheduling the jobs like this:

kubectl run collector-60053 --schedule=30 10 * * * * --image=gcr.io/myimage/collector --restart=OnFailure --command node collector.js

I'm having a problem where some of these jobs are running and failing but the associated pods are disappearing, so I have no way to look at the logs and they are not restarting.

For example:

$ kubectl get jobs | grep 60053
collector-60053-1546943400     1         0            1h
$ kubectl get pods -a | grep 60053
$    // nothing returned

This is on Google Cloud Platform running 1.10.9-gke.5

Any help would be much appreciated!

EDIT:

I discovered some more information. I have auto-scaling setup on my GCP cluster. I noticed that when the servers are removed the pods are also removed (and their meta data). Is that expected behavior? Unfortunately this gives me no easy way to look at the pod logs.

My theory is that as pods fail, the CrashLoopBackOff kicks in and eventually auto-scaling decides that the node is no longer needed (it doesn't see the pod as an active workload). At this point, the node goes away and so do the pods. I don't think this is expected behavior with Restart OnFailure but I basically witnessed this by watching it closely.

234

asked Jan 08 '19 12:01

user1527312

1 Answers

After digging much further into this issue, I have an understating of my situation. According to issue 54870 on the Kubernetes repository, there are some problems with jobs when set to Restart=OnFailure.

I have changed my configuration to use Restart=Never and to set a backoffLimit for the job. Even though restart is set to never, in my testing with restart never, Kubernetes will actually restart the pods up to the backoffLimit setting and keep the error pods around for inspection.

181

answered Sep 18 '22 15:09

user1527312

Related questions
                            
                                Kubernetes CronJob - Skip job if previous is still running AND wait for the next schedule time
                            
                                calico/node is not ready: BIRD is not ready: BGP not established
                            
                                Printing not being logged by Kubernetes
                            
                                Does a completed kubernetes pod still reserves the required resources?
                            
                                adding single quotes to helm value
                            
                                All Kubernetes proxy targets down - Prometheus Operator
                            
                                How to debug Kubernetes CreateContainerConfigError
                            
                                Multiple K8S containers connecting to Google Cloud SQL through proxy
                            
                                Get Deployment annotation from a Kubernetes Pod
                            
                                Cannot deploy MySQL pod. --initialize specified but the data directory has files in it
                            
                                kubelet.service: Main process exited, code=exited, status=255/n/a
                            
                                Using Horizontal Pod Autoscaling along with resource requests and limits
                            
                                Pass multiple variables in helm template
                            
                                What is the difference between Kubernetes and Amazon ECS
                            
                                How to access GKE kubectl proxy dashboard?
                            
                                kubectl pod fails to pull down an AWS ECR image
                            
                                what is the relationship between EXPOSE in the dockerfile and TARGETPORT in the service YAML and actual running port in the Pod?
                            
                                Schedule cron job to never happen?
                            
                                Set vm.max_map_count on cluster nodes
                            
                                Kubernetes NGINX Ingress configmap 301 redirect

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Kubernetes pods disappear after failed jobs

Tags:

kubernetes

google-kubernetes-engine

user1527312

People also ask

1 Answers

user1527312

Recent Activity

Donate For Us