I have one counter metric error_in_execution. Whenever the error appears counter.inc(); called.
I have the following alert expression that triggers when the counter increase.
expr: increase(error_in_execution[5m]) > 0
for: 5m
Now the issue is, when there is no metric exists and an error appear the first time, the counter value increase to 1. Which is not detected by this alert expression and it did not trigger. Then when the counter increases to 2. Alert triggered.
The following example would be easy to understand.
Time 0:
Prometheus: error_in_execution --> No Metric Exsist.
Alert: increase(error_in_execution[5m]) > 0 --> Not triggered
Time 1: Error occur [error_in_execution.inc()]
Prometheus: error_in_execution --> 1
Alert: increase(error_in_execution[5m]) > 0 --> Still Not triggered <<<<<< It should be triggered. ( Please help here)
Time 2: Error occur [error_in_execution.inc()]
Prometheus: error_in_execution --> 2
Alert: increase(error_in_execution[5m]) > 0 --> Alert triggerd.
I think I found a workaround for this.
For counters that existed before t, increase(_metric_[t]) is equivalent to _metric_ - _metric_ offset t. (it's not, but that is a different issue).
For counters that did not exist before t, the increase is simply the metrics value _metric_ - 0 = _metric_.
We can find out whether a metric existed at point t by querying it _metric_ offset t. And we can use that as a WHERE NOT EXISTS filter using the unless operator.
Putting it together, we get following query:
( _metric_ unless _metric offset 1d ) or ( _metric_ - _metric_ offset 1d )
^-----------new counters------------^ ^--------existing counters------^
One event happens each timeframe, we want to measure the increase over 2 timeframes.
Expected:
- none for each query frame before the first occurrence
- one for the query frame on first occurrence
- 2 for each query frame beyond the first occurrence
t0 t1 t2 t3 t4 t5
_metric_ - - 1 2 3 4
_metric offset 2t - - - - 1 2
__ unless __ offset 2t - - 1 2 - -
__ <minus> __ offset 2t - - - - 2 2
=====================================================
() or () - - 1 2 2 2
Grafana example graph
total is the raw counter value, increase is the result of the query. It is still split in two series because the metric name is dropped on the - operation, but not on unless. But summing them up again works well, and is something you will probably do anyways.
Grafana graph with sum
It's really a shame prometheus makes it so hard for everyone who does not use it to display cpu temperature. This is one of the instances where my pride to have found a solution is only surpassed by my exasperation that it was necessary in the first place.
This is a "normal" behaviour. If the metric does not exist before and is then initialized with the value 1, this is not considered in functions like increase() or rate().
To catch the very first error, you need to make sure, that the metric exists from the beginning when your application starts having the initial value 0, then the first incrementatation will trigger your alert.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With