I have batch jobs that I want to run on Kubernetes. The way I understand Jobs:
If I choose restartPolicy: Never
it means that if the Job fails, it will destroy the Pod and reschedule onto (potentially) another node. If restartPolicy: OnFailure
, it will restart the container in the existing Pod. I'd consider a certain number of failures unrecoverable. Is there a way I can prevent it from rescheduling or restarting after a certain period of time and cleanup the unrecoverable Jobs?
My current thought for a workaround to this is to have some watchdog process that looks at retryTimes and cleans up Jobs after a specified number of retries.
Summary of slack discussion:
No, there is no retry limit. However, you can set a deadline on the job as of v1.2 with activeDeadlineSeconds
. The system should back off restarts and then terminate the job when it hits the deadline.
FYI, this has now been added as .spec.backoffLimit
.
https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With