I want to be alerted if log_error_count
has incremented by at least 1
in the past one minute.
So originally my query looked like
ALERT BackendErrors
IF rate(log_error_count[1m]) > 0
FOR 1s
...
But then I tried to sanity check the graph using the prometheus dashboard.
Using the query
log_error_count
My graph looks like
When I look at the graph with the query
rate(log_error_count[2m])
My graph looks like
In fact I've also tried functions irate
, changes
, and delta
, and they all become zero.
Why is the rate zero and what does my query need to look like for me to be able to alert when a counter has been incremented even once?
I had a similar issue with planetlabs/draino:
I wanted to be able to detect when it drained a node.
(Unfortunately, they carry over their minimalist logging policy, which makes sense for logging, over to metrics where it doesn't make sense...)
The draino_pod_ip:10002/metrics endpoint's webpage is completely empty... does not exist until the first drain occurs...
My needs were slightly more difficult to detect, I had to deal with metric does not exist when value = 0 (aka on pod reboot).
I had to detect the transition from does not exist -> 1, and from n -> n+1.
This is what I came up with, note the metric I was detecting is an integer, I'm not sure how this will worth with decimals, even if it needs tweaking for your needs I think it may help point you in the right direction:
(absent(draino_cordoned_nodes_total offset 1m) == 1 and count(draino_cordoned_nodes_total) > -1)
^ creates a blip of 1 when the metric switches from does not exist to exists
((draino_cordoned_nodes_total - draino_cordoned_nodes_total offset 1m) > 0)
^ creates a blip of 1 when it increases from n -> n+1
Combining the 2:
(absent(draino_cordoned_nodes_total offset 1m) == 1 and count(draino_cordoned_nodes_total) > -1) or ((draino_cordoned_nodes_total - draino_cordoned_nodes_total offset 1m) > 0)
^ or'ing them both together allowed me to detect changes as a single blip of 1 on a grafana graph, I think that's what you're after.
@neokyle has a great solution depending on the metrics you're using.
In my case I needed to solve a similar problem. The issue was that I also have labels that need to be included in the alert. And it was not feasible to use absent as that would mean generating an alert for every label. (I'm using Jsonnet so this is feasible, but still quite annoying!)
The key in my case was to use unless
which is the complement operator. I wrote something that looks like this:
(my_metric unless my_metric offset 15m) > 0
This will result in a series after a metric goes from absent to non-absent, while also keeping all labels. The series will last for as long as offset is, so this would create a 15m blip. It's not super intuitive, but my understanding is that it's true when the series themselves are different. So this won't trigger when the value changes, for instance.
You could move on to adding or
for (increase / delta) > 0 depending on what you're working with. This is a bit messy but to give an example:
(
my_metric
unless my_metric offset 15m
) > 0
or
(
delta(
my_metric[15m]
)
) > 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With