Is there a way to monitor kube cronjob.
I have a kube cronjob which runs every 10mins on my cluster.. is there a way to collect metrics everytime my cronjob fails due to some error or notify when my cronjob has not been completed after a certain period of time.
System cron monitoring Connect these VMs to the Prometheus Pod. On VMs, configure node exporters to send system cron status metrics to Prometheus. In Prometheus, use these metrics to set up trigger rules for alerts. In a Grafana Dashboard, display the latest status of each OpenShift CronJob and each system cron.
A CronJob creates Jobs on a repeating schedule. One CronJob object is like one line of a crontab (cron table) file. It runs a job periodically on a given schedule, written in Cron format.
I'm using these rules with kube-state-metrics:
groups:
- name: job.rules
rules:
- alert: CronJobRunning
expr: time() -kube_cronjob_next_schedule_time > 3600
for: 1h
labels:
severity: warning
annotations:
description: CronJob {{$labels.namespaces}}/{{$labels.cronjob}} is taking more than 1h to complete
summary: CronJob didn't finish after 1h
- alert: JobCompletion
expr: kube_job_spec_completions - kube_job_status_succeeded > 0
for: 1h
labels:
severity: warning
annotations:
description: Job completion is taking more than 1h to complete
cronjob {{$labels.namespaces}}/{{$labels.job}}
summary: Job {{$labels.job}} didn't finish to complete after 1h
- alert: JobFailed
expr: kube_job_status_failed > 0
for: 1h
labels:
severity: warning
annotations:
description: Job {{$labels.namespaces}}/{{$labels.job}} failed to complete
summary: Job failed
The tricky part here is the cronjobs themselves have no useful status, you have to match them to the jobs they create. I've written up an article on how to achieve this:
https://medium.com/@tristan_96324/prometheus-k8s-cronjob-alerts-94bee7b90511
The article goes into a bit of detail as to how things work, but the alert config is as follow:
groups:
- name: kube-cron
rules:
- record: job_cronjob:kube_job_status_start_time:max
expr: |
label_replace(
label_replace(
max(
kube_job_status_start_time
* ON(exported_job) GROUP_RIGHT()
kube_job_labels{label_cronjob!=""}
) BY (exported_job, label_cronjob)
== ON(label_cronjob) GROUP_LEFT()
max(
kube_job_status_start_time
* ON(exported_job) GROUP_RIGHT()
kube_job_labels{label_cronjob!=""}
) BY (label_cronjob),
"job", "$1", "exported_job", "(.+)"),
"cronjob", "$1", "label_cronjob", "(.+)")
- record: job_cronjob:kube_job_status_failed:sum
expr: |
clamp_max(
job_cronjob:kube_job_status_start_time:max,
1)
* ON(job) GROUP_LEFT()
label_replace(
label_replace(
(kube_job_status_failed != 0),
"job", "$1", "exported_job", "(.+)"),
"cronjob", "$1", "label_cronjob", "(.+)")
- alert: CronJobStatusFailed
expr: |
job_cronjob:kube_job_status_failed:sum
* ON(cronjob) GROUP_RIGHT()
kube_cronjob_labels
> 0
for: 1m
annotations:
description: '{{ $labels.cronjob }} last run has failed {{$value }} times.'
The jobTemplate must include a label called cronjob
that matches the name of the cronjob object.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With