Not sure what I am doing wrong, but I am experiencing an issue where CronJobs stop scheduling new Jobs. It seems like this happens only after a couple of failures to launch a new Job. In my specific case, Jobs were not able to start due an inability to pull the container image.
I'm not really finding any settings that would lead to this, but I'm no expert on Kubernetes CronJobs. Configuration below:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
labels:
app.kubernetes.io/instance: cron-deal-report
app.kubernetes.io/managed-by: Tiller
app.kubernetes.io/name: cron
helm.sh/chart: cron-0.1.0
name: cron-deal-report
spec:
concurrencyPolicy: Forbid
failedJobsHistoryLimit: 1
jobTemplate:
metadata:
creationTimestamp: null
spec:
template:
spec:
containers:
- args:
- -c
- npm run script
command:
- /bin/sh
env:
image: <redacted>
imagePullPolicy: Always
name: cron
resources: {}
securityContext:
runAsUser: 1000
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
schedule: 0/15 * * * *
successfulJobsHistoryLimit: 3
suspend: false
status: {}
As per Jobs - Run to Completion - Handling Pod and Container Failures:
An entire Pod can
alsofail, for a number of reasons, such as when the pod is kicked off the node (node is upgraded, rebooted, deleted, etc.), or if a container of the Pod fails and the.spec.template.spec.restartPolicy = "Never"
. When a Pod fails, then the Job controller starts a new Pod.
You are using restartPolicy: Never
for your jobTemplate
, so, see the next quote on Pod backoff failure policy:
There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set
.spec.backoffLimit
to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. The back-off count is reset if no new failed Pods appear before the Job’s next status check.
The .spec.backoffLimit
is not defined in your jobTemplate
, so it's using the default (6
).
Following, as per Job Termination and Cleanup:
By default, a Job will run uninterrupted unless a Pod fails, at which point the Job defers to the
.spec.backoffLimit
described above. Another way to terminate a Job is by setting an active deadline. Do this by setting the.spec.activeDeadlineSeconds
field of the Job to a number of seconds.
That's your case: If your containers fail to pull the image six consecutive times, your Job will be considered as failed.
As per Cron Job Limitations:
A cron job creates a job object about once per execution time of its schedule [...]. The Cronjob is only responsible for creating Jobs that match its schedule, and the Job in turn is responsible for the management of the Pods it represents.
This means that all pod/container failures should be handled by the Job Controller (i.e., adjusting the jobTemplate
).
"Retrying" a Job:
You do not need to recreate a Cronjob in case its Job of fails. You only need to wait for the next schedule.
If you want to run a new Job before the next schedule, you can use the Cronjob template to create a Job manually with:
kubectl create job --from=cronjob/my-cronjob-name my-manually-job-name
If your containers are unable to download the images constantly, you have the following options:
backoffLimit
to a higher value.restartPolicy: OnFailure
for your containers, so the Pod will stay on the node, and only the container will be re-run.imagePullPolicy: IfNotPresent
. If you are not retagging your images, there is no need to force a re-pull for every job start.Just to expand on Eduardo Baitello's answer I would also like to mention 2 more caveats:
Eduardo mentioned Cronjob Limitations, but didn't expand on the Too many missed start time (> 100)
issue. For this I've found that the only solution is to delete the cronjob and recreate it. You can patch the cronjob to decrease its frequency which tricks the scheduler to run it again. Then you can re-patch it back to how it was but this is trickier. The kubectl describe cronjob CRONJOB_NAME
should list this as one of its events if this has been affected, and it usually affects cronjobs which have a high frequency.
If you have a lot of Cronjobs
/Jobs
then you could be experiencing this bug (#77465) which has been fixed in 1.14.7
. This occurs if you have more than 500
Jobs within the entire cluster. This one is harder to find, but you can query the kube-scheduler
logs for expected type *batchv1.JobList, got type *internalversion.List
.
You can print the logs for kube-scheduler
using the following command:
kubectl -n kube-system logs -l component=kube-scheduler --tail 100
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With