Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding backoffLimit in Kubernetes Job

Tags:

I’ve created a Cronjob in kubernetes with schedule(8 * * * *), with job’s backoffLimit defaulting to 6 and pod’s RestartPolicy to Never, the pods are deliberately configured to FAIL. As I understand, (for podSpec with restartPolicy : Never) Job controller will try to create backoffLimit number of pods and then it marks the job as Failed, so, I expected that there would be 6 pods in Error state.

This is the actual Job’s status:

status:   conditions:   - lastProbeTime: 2019-02-20T05:11:58Z     lastTransitionTime: 2019-02-20T05:11:58Z     message: Job has reached the specified backoff limit     reason: BackoffLimitExceeded     status: "True"     type: Failed   failed: 5 

Why were there only 5 failed pods instead of 6? Or is my understanding about backoffLimit in-correct?

like image 648
goutham Avatar asked Feb 22 '19 11:02

goutham


People also ask

What is backoffLimit in Kubernetes job?

backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s ...) capped at six minutes.

How Kubernetes cron job works?

Overview. CronJobs create Kubernetes Jobs on a repeating schedule. CronJobs allow you to automate regular tasks like making backups, creating reports, sending emails, or cleanup tasks. CronJobs are created, managed, scaled, and deleted in the same way as Jobs.

What are Kubernetes jobs used for?

The main function of a job is to create one or more pod and tracks about the success of pods. They ensure that the specified number of pods are completed successfully. When a specified number of successful run of pods is completed, then the job is considered complete.


1 Answers

In short: You might not be seeing all created pods because period of schedule in the cronjob is too short.

As described in documentation:

Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s …) capped at six minutes. The back-off count is reset if no new failed Pods appear before the Job’s next status check.

If new job is scheduled before Job controller has a chance to recreate a pod (having in mind the delay after previous failure), Job controller starts counting from one again.

I reproduced your issue in GKE using following .yaml:

apiVersion: batch/v1beta1 kind: CronJob metadata:   name: hellocron spec:   schedule: "*/3 * * * *" #Runs every 3 minutes   jobTemplate:     spec:       template:         spec:           containers:           - name: hellocron             image: busybox             args:             - /bin/cat             - /etc/os           restartPolicy: Never       backoffLimit: 6   suspend: false 

This job will fail because file /etc/os doesn't exist.

And here is an output of kubectl describe for one of the jobs:

Name:           hellocron-1551194280 Namespace:      default Selector:       controller-uid=b81cdfb8-39d9-11e9-9eb7-42010a9c00d0 Labels:         controller-uid=b81cdfb8-39d9-11e9-9eb7-42010a9c00d0                 job-name=hellocron-1551194280 Annotations:    <none> Controlled By:  CronJob/hellocron Parallelism:    1 Completions:    1 Start Time:     Tue, 26 Feb 2019 16:18:07 +0100 Pods Statuses:  0 Running / 0 Succeeded / 6 Failed Pod Template:   Labels:  controller-uid=b81cdfb8-39d9-11e9-9eb7-42010a9c00d0            job-name=hellocron-1551194280   Containers:    hellocron:     Image:      busybox     Port:       <none>     Host Port:  <none>     Args:       /bin/cat       /etc/os     Environment:  <none>     Mounts:       <none>   Volumes:        <none> Events:   Type     Reason                Age   From            Message   ----     ------                ----  ----            -------   Normal   SuccessfulCreate      26m   job-controller  Created pod: hellocron-1551194280-4lf6h   Normal   SuccessfulCreate      26m   job-controller  Created pod: hellocron-1551194280-85khk   Normal   SuccessfulCreate      26m   job-controller  Created pod: hellocron-1551194280-wrktb   Normal   SuccessfulCreate      26m   job-controller  Created pod: hellocron-1551194280-6942s   Normal   SuccessfulCreate      25m   job-controller  Created pod: hellocron-1551194280-662zv   Normal   SuccessfulCreate      22m   job-controller  Created pod: hellocron-1551194280-6c6rh   Warning  BackoffLimitExceeded  17m   job-controller  Job has reached the specified backoff limit 

Note the delay between creation of pods hellocron-1551194280-662zv and hellocron-1551194280-6c6rh.

like image 127
MWZ Avatar answered Sep 17 '22 07:09

MWZ