Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prometheus: Alert on change in value

Tags:

prometheus

I want to be alerted if log_error_count has incremented by at least 1 in the past one minute.

So originally my query looked like

ALERT BackendErrors
  IF rate(log_error_count[1m]) > 0
  FOR 1s
  ...

But then I tried to sanity check the graph using the prometheus dashboard.

Using the query

log_error_count

My graph looks like

log_error_count

When I look at the graph with the query

rate(log_error_count[2m])

My graph looks like

rate(log_error_count[2m])

In fact I've also tried functions irate, changes, and delta, and they all become zero.

Why is the rate zero and what does my query need to look like for me to be able to alert when a counter has been incremented even once?

like image 265
math4tots Avatar asked May 09 '17 17:05

math4tots


2 Answers

I had a similar issue with planetlabs/draino:
I wanted to be able to detect when it drained a node.
(Unfortunately, they carry over their minimalist logging policy, which makes sense for logging, over to metrics where it doesn't make sense...)
The draino_pod_ip:10002/metrics endpoint's webpage is completely empty... does not exist until the first drain occurs...
My needs were slightly more difficult to detect, I had to deal with metric does not exist when value = 0 (aka on pod reboot).
I had to detect the transition from does not exist -> 1, and from n -> n+1.
This is what I came up with, note the metric I was detecting is an integer, I'm not sure how this will worth with decimals, even if it needs tweaking for your needs I think it may help point you in the right direction:


(absent(draino_cordoned_nodes_total offset 1m) == 1 and count(draino_cordoned_nodes_total) > -1)

^ creates a blip of 1 when the metric switches from does not exist to exists

((draino_cordoned_nodes_total - draino_cordoned_nodes_total offset 1m) > 0)

^ creates a blip of 1 when it increases from n -> n+1


Combining the 2:

(absent(draino_cordoned_nodes_total offset 1m) == 1 and count(draino_cordoned_nodes_total) > -1) or ((draino_cordoned_nodes_total - draino_cordoned_nodes_total offset 1m) > 0)

^ or'ing them both together allowed me to detect changes as a single blip of 1 on a grafana graph, I think that's what you're after.

like image 66
neokyle Avatar answered Sep 23 '22 07:09

neokyle


@neokyle has a great solution depending on the metrics you're using.

In my case I needed to solve a similar problem. The issue was that I also have labels that need to be included in the alert. And it was not feasible to use absent as that would mean generating an alert for every label. (I'm using Jsonnet so this is feasible, but still quite annoying!)

The key in my case was to use unless which is the complement operator. I wrote something that looks like this:

(my_metric unless my_metric offset 15m) > 0

This will result in a series after a metric goes from absent to non-absent, while also keeping all labels. The series will last for as long as offset is, so this would create a 15m blip. It's not super intuitive, but my understanding is that it's true when the series themselves are different. So this won't trigger when the value changes, for instance.

You could move on to adding or for (increase / delta) > 0 depending on what you're working with. This is a bit messy but to give an example:

(
  my_metric
  unless my_metric offset 15m
) > 0
or
(
  delta(
    my_metric[15m]
  )
) > 0
like image 45
Jacob Colvin Avatar answered Sep 22 '22 07:09

Jacob Colvin