Is there a way to monitor kube cron jobs using prometheus

Tags:

Is there a way to monitor kube cronjob.

I have a kube cronjob which runs every 10mins on my cluster.. is there a way to collect metrics everytime my cronjob fails due to some error or notify when my cronjob has not been completed after a certain period of time.

618

asked Nov 17 '17 05:11

user3587892

2 Answers

I'm using these rules with kube-state-metrics:

groups:
- name: job.rules
  rules:
  - alert: CronJobRunning
    expr: time() -kube_cronjob_next_schedule_time > 3600
    for: 1h
    labels:
      severity: warning
    annotations:
      description: CronJob {{$labels.namespaces}}/{{$labels.cronjob}} is taking more than 1h to complete
      summary: CronJob didn't finish after 1h

  - alert: JobCompletion
    expr: kube_job_spec_completions - kube_job_status_succeeded  > 0
    for: 1h
    labels:
      severity: warning
    annotations:
      description: Job completion is taking more than 1h to complete
        cronjob {{$labels.namespaces}}/{{$labels.job}}
      summary: Job {{$labels.job}} didn't finish to complete after 1h

  - alert: JobFailed
    expr: kube_job_status_failed  > 0
    for: 1h
    labels:
      severity: warning
    annotations:
      description: Job {{$labels.namespaces}}/{{$labels.job}} failed to complete
      summary: Job failed

164

answered Oct 15 '22 07:10

Camil

The tricky part here is the cronjobs themselves have no useful status, you have to match them to the jobs they create. I've written up an article on how to achieve this:

https://medium.com/@tristan_96324/prometheus-k8s-cronjob-alerts-94bee7b90511

The article goes into a bit of detail as to how things work, but the alert config is as follow:

groups:
- name: kube-cron
  rules:
  - record: job_cronjob:kube_job_status_start_time:max
    expr: |
      label_replace(
        label_replace(
          max(
            kube_job_status_start_time
            * ON(exported_job) GROUP_RIGHT()
            kube_job_labels{label_cronjob!=""}
          ) BY (exported_job, label_cronjob)
          == ON(label_cronjob) GROUP_LEFT()
          max(
            kube_job_status_start_time
            * ON(exported_job) GROUP_RIGHT()
            kube_job_labels{label_cronjob!=""}
          ) BY (label_cronjob),
          "job", "$1", "exported_job", "(.+)"),
        "cronjob", "$1", "label_cronjob", "(.+)")

  - record: job_cronjob:kube_job_status_failed:sum
    expr: |
  clamp_max(
        job_cronjob:kube_job_status_start_time:max,
      1)
      * ON(job) GROUP_LEFT()
      label_replace(
        label_replace(
          (kube_job_status_failed != 0),
          "job", "$1", "exported_job", "(.+)"),
        "cronjob", "$1", "label_cronjob", "(.+)")


  - alert: CronJobStatusFailed
    expr: |
      job_cronjob:kube_job_status_failed:sum
      * ON(cronjob) GROUP_RIGHT()
      kube_cronjob_labels
      > 0
    for: 1m
    annotations:
      description: '{{ $labels.cronjob }} last run has failed {{$value }} times.'

The jobTemplate must include a label called cronjob that matches the name of the cronjob object.

answered Oct 15 '22 06:10

Tristan Colgate

Related questions
                            
                                Android Studio 3.0 Unsigned Apk Not Installing
                            
                                Where are Java 8 lambda expressions evaluated?
                            
                                How to limit result in @Query used in Spring Data Repository
                            
                                How to detect text (string) language in iOS?
                            
                                How do I (elegantly) transpose textbox over label at specific part of string?
                            
                                Syntax error while using backslash in Jenkinsfile
                            
                                Get city name either do not start with vowels or do not end with vowels
                            
                                Typescript assert-like type guard
                            
                                How to restart pod in OpenShift?
                            
                                My app violates the Android Advertising ID policy [duplicate]
                            
                                Azure Function, EF Core, Can't load ComponentModel.Annotations 4.2.0.0
                            
                                Constructors in dart [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With