Understanding backoffLimit in Kubernetes Job

Tags:

I’ve created a Cronjob in kubernetes with schedule(8 * * * *), with job’s backoffLimit defaulting to 6 and pod’s RestartPolicy to Never, the pods are deliberately configured to FAIL. As I understand, (for podSpec with restartPolicy : Never) Job controller will try to create backoffLimit number of pods and then it marks the job as Failed, so, I expected that there would be 6 pods in Error state.

This is the actual Job’s status:

status:   conditions:   - lastProbeTime: 2019-02-20T05:11:58Z     lastTransitionTime: 2019-02-20T05:11:58Z     message: Job has reached the specified backoff limit     reason: BackoffLimitExceeded     status: "True"     type: Failed   failed: 5

Why were there only 5 failed pods instead of 6? Or is my understanding about backoffLimit in-correct?

648

asked Feb 22 '19 11:02

goutham

1 Answers

In short: You might not be seeing all created pods because period of schedule in the cronjob is too short.

As described in documentation:

Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s …) capped at six minutes. The back-off count is reset if no new failed Pods appear before the Job’s next status check.

If new job is scheduled before Job controller has a chance to recreate a pod (having in mind the delay after previous failure), Job controller starts counting from one again.

I reproduced your issue in GKE using following .yaml:

apiVersion: batch/v1beta1 kind: CronJob metadata:   name: hellocron spec:   schedule: "*/3 * * * *" #Runs every 3 minutes   jobTemplate:     spec:       template:         spec:           containers:           - name: hellocron             image: busybox             args:             - /bin/cat             - /etc/os           restartPolicy: Never       backoffLimit: 6   suspend: false

This job will fail because file /etc/os doesn't exist.

And here is an output of kubectl describe for one of the jobs:

Name:           hellocron-1551194280 Namespace:      default Selector:       controller-uid=b81cdfb8-39d9-11e9-9eb7-42010a9c00d0 Labels:         controller-uid=b81cdfb8-39d9-11e9-9eb7-42010a9c00d0                 job-name=hellocron-1551194280 Annotations:    <none> Controlled By:  CronJob/hellocron Parallelism:    1 Completions:    1 Start Time:     Tue, 26 Feb 2019 16:18:07 +0100 Pods Statuses:  0 Running / 0 Succeeded / 6 Failed Pod Template:   Labels:  controller-uid=b81cdfb8-39d9-11e9-9eb7-42010a9c00d0            job-name=hellocron-1551194280   Containers:    hellocron:     Image:      busybox     Port:       <none>     Host Port:  <none>     Args:       /bin/cat       /etc/os     Environment:  <none>     Mounts:       <none>   Volumes:        <none> Events:   Type     Reason                Age   From            Message   ----     ------                ----  ----            -------   Normal   SuccessfulCreate      26m   job-controller  Created pod: hellocron-1551194280-4lf6h   Normal   SuccessfulCreate      26m   job-controller  Created pod: hellocron-1551194280-85khk   Normal   SuccessfulCreate      26m   job-controller  Created pod: hellocron-1551194280-wrktb   Normal   SuccessfulCreate      26m   job-controller  Created pod: hellocron-1551194280-6942s   Normal   SuccessfulCreate      25m   job-controller  Created pod: hellocron-1551194280-662zv   Normal   SuccessfulCreate      22m   job-controller  Created pod: hellocron-1551194280-6c6rh   Warning  BackoffLimitExceeded  17m   job-controller  Job has reached the specified backoff limit

Note the delay between creation of pods hellocron-1551194280-662zv and hellocron-1551194280-6c6rh.

127

answered Sep 17 '22 07:09

MWZ

Related questions
                            
                                Class with static methods vs exported functions typescript
                            
                                What is the default integer type in Rust?
                            
                                Xamarin - This release is not compliant with the Google Play 64-bit requirement
                            
                                How to replace null value with value from the next row
                            
                                What is the difference between tick() and flush() in angular testing?
                            
                                How to use random to choose colors
                            
                                iOS 13 status bar style
                            
                                Rails: Vanilla Rails 6.0 error Command "webpack" not found
                            
                                SQLAlchemy 'bulk_save_objects' vs 'add_all' underlying logic difference?
                            
                                How can we install google-chrome-stable on alpine image in dockerfile using dpkg?
                            
                                pip: sys.stderr.write(f"ERROR: {exc}") with Python 3.5 [duplicate]
                            
                                Suppress Angular Language Service VSCode extension's "strictTemplates in angularCompilerOptions" notification

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Understanding backoffLimit in Kubernetes Job

Tags:

goutham

People also ask

1 Answers

MWZ

Recent Activity

Donate For Us