How to fail a (cron) job after a certain number of retries?

Tags:

kubernetes

We have a Kubernetes cluster of web scraping cron jobs set up. All seems to go well until a cron job starts to fail (e.g., when a site structure changes and our scraper no longer works). It looks like every now and then a few failing cron jobs will continue to retry to the point it brings down our cluster. Running kubectl get cronjobs (prior to a cluster failure) will show too many jobs running for a failing job.

I've attempted following the note described here regarding a known issue with the pod backoff failure policy; however, that does not seem to work.

Here is our config for reference:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: scrape-al
spec:
  schedule: '*/15 * * * *'
  concurrencyPolicy: Allow
  failedJobsHistoryLimit: 0
  successfulJobsHistoryLimit: 0
  jobTemplate:
    metadata:
      labels:
        app: scrape
        scrape: al
    spec:
      template:
        spec:
          containers:
            - name: scrape-al
              image: 'govhawk/openstates:1.3.1-beta'
              command:
                - /opt/openstates/openstates/pupa-scrape.sh
              args:
                - al bills --scrape
          restartPolicy: Never
      backoffLimit: 3

Ideally we would prefer that a cron job would be terminated after N retries (e.g., something like kubectl delete cronjob my-cron-job after my-cron-job has failed 5 times). Any ideas or suggestions would be much appreciated. Thanks!

848

asked Jan 29 '18 16:01

doubleswirve

1 Answers

You can tell your Job to stop retrying using backoffLimit.

Specifies the number of retries before marking this job failed.

In your case

spec:
  template:
    spec:
      containers:
        - name: scrape-al
          image: 'govhawk/openstates:1.3.1-beta'
          command:
            - /opt/openstates/openstates/pupa-scrape.sh
          args:
            - al bills --scrape
      restartPolicy: Never
  backoffLimit: 3

You set 3 asbackoffLimit of your Job. That means when a Job is created by CronJob, It will retry 3 times if fails. This controls Job, not CronJob

When Job is failed, another Job will be created again as your scheduled period.

You want: If I am not wrong, you want to stop scheduling new Job, when your scheduled Jobs are failed for 5 times. Right?

Answer: In that case, this is not possible automatically.

Possible solution: You need to suspend CronJob so than it stop scheduling new Job.

Suspend: true

You can do this manually. If you do not want to do this manually, you need to setup a watcher, that will watch your CronJob status, and will update CronJob to suspend if necessary.

answered Sep 19 '22 07:09

Shahriar

Related questions
                            
                                Kubernetes Ingress network deny some paths
                            
                                Does it make sense to run Kubernetes on a single server?
                            
                                glog flag redefined error
                            
                                kubernetes configmap prints \n instead of a newline
                            
                                Using Kubernetes' hooks
                            
                                Ingress responding with 'default backend - 404' when using GKE
                            
                                Google Kubernetes Engine: Enable HTTPS for Service type
                            
                                kubectl not able to pull the image from private repository
                            
                                CORS rules nginx-ingress rules
                            
                                Running Kubernetes locally on M1 Mac
                            
                                Automatically use secret when pulling from private registry
                            
                                Manage replicas count for deployment using Kubernetes API
                            
                                Keep running into "exceeded its progress dead line" despite changing progressDeadlineSeconds
                            
                                How to mount S3 bucket on Kubernetes container/pods?
                            
                                What is the difference between “container_memory_working_set_bytes” and “container_memory_rss” metric on the container
                            
                                Kubernetes Communication between Frontend and Backend
                            
                                Shared PersistenceVolumeClaim(PVC) across namespaces
                            
                                helm error when updating: UPGRADE FAILED: The order in patch list
                            
                                Kubernetes Watch Pod Events with api
                            
                                Minikube got stuck when creating container

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With