There are several metrics collected for cron jobs, unfortunately I‘m not sure how to use them properly.
I wanted to use the kube_job_status_failed == 1 metrics. I can use a regex for job=~“+.myjobname.+“ to aggregate all failed attempts for a cron job.
This is where i got stuck. Is there a way to count the amount of distinct labels(=number of failed attempts) in a given time period?
Or can I use the metrics the other way around meaning checking whether there was kube_job_status_succeeded{job=~“+.myjobname+.“}==1 in a given time period?
I feel like I’m so close to solving this but I just can’t wrap my head around it.
EDIT: Added PictureThis shows that there clearly are several succeded jobs over time, I just have no clue on how to count them
Alright people, here is a somewhat gross way to do this that you can generalize for gauges that you only want to count the initial value of:
Step 1: Make it so you can count the gauge value just once (effectively count the amount of distinct labels):
sum(kube_job_failed{condition="true"} unless kube_job_failed offset 1m)
What you will see with this metric is a graph of job failures when they happen that don't persist after.
This is assuming a scrape interval of 1m. If you scrape kube-state-metrics every 30s this will double count some and you should use 30s. The way that this works is that we're doing a left anti-join with unless to remove all metrics in the range vector that existed 1 scrape interval earlier. That will let you count the metrics just once, the first time they are scraped.
Step 2:
sum_over_time(sum(kube_job_failed{condition="true"} unless kube_job_failed offset 1m)[1h:]
This is going to sum the previous query for the time range you give it - in this case, the past 1 hour.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With