Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to gracefully avoid divide by zero in Prometheus

Tags:

prometheus

There are times when you need to divide one metric by another metric.

For example, I'd like to calculate a mean latency like that:

rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
/
rate({__name__="hystrix_command_latency_total_seconds_count"}[60s])

If there is no activity during the specified time period, the rate() in the divider becomes 0 and the result of division becomes NaN. If I do some aggregation over the results (avg() or sum() or whatever), the whole aggregation result becomes NaN.

So I add a check for zero in divider:

rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
/
(rate({__name__="hystrix_command_latency_total_seconds_count"}[60s]) > 0)

This removes NaNs from the result vector. And also tears the line on the graph to shreds.

Let's mark periods of inactivity with 0 value to make the graph continuous again:

rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
/
(rate({__name__="hystrix_command_latency_total_seconds_count"}[60s]) > 0)
or
rate({__name__="hystrix_command_latency_total_seconds_count"}[60s]) > bool 0

This effectively replaces NaNs with 0, graph is continuous, aggregations work OK.

But resulting query is slightly cumbersome, especially when you need to do more label filtering and do some aggregations over results. Something like that:

avg(
    1000 * increase({__name__=~".*_hystrix_command_latency_total_seconds_sum", command_group=~"$commandGroup", command_name=~"$commandName", job=~"$service", instance=~"$instance"}[60s])
    /
    (increase({__name__=~".*_hystrix_command_latency_total_seconds_count", command_group=~"$commandGroup", command_name=~"$commandName", job=~"$service", instance=~"$instance"}[60s]) > 0)
    or
    increase({__name__=~".*_hystrix_command_latency_total_seconds_count", command_group=~"$commandGroup", command_name=~"$commandName", job=~"$service", instance=~"$instance"}[60s]) > bool 0
) by (command_group, command_name)

Long story short: Are there any simpler ways to deal with zeros in divider? Or any common practices?

like image 304
Yoory N. Avatar asked Nov 01 '17 13:11

Yoory N.


People also ask

What is NaN in Prometheus?

NaN is just a number in Prometheus. Some monitoring systems use NaN as a null or missing value, however in Prometheus NaN is just another floating point value. The way Prometheus represents missing data is to have the data, uhm, missing.

How is rate calculated in Prometheus?

Prometheus calculates rate(count[d]) at timestamp t in the following way: It obtains raw samples per each time series with count name on the time range (t-d ... t] . Note that t-d timestamp isn't included in the range, while t timestamp is included in the range.


2 Answers

If there is no activity during the specified time period, the rate() in the divider becomes 0 and the result of division becomes NaN.

This is the correct behaviour, NaN is what you want the result to be.

aggregations work OK.

You can't aggregate ratios. You need to aggregate the numerator and denominator separately and then divide.

So:

   sum by (command_group, command_name)(rate(hystrix_command_latency_total_seconds_sum[5m]))
  /
   sum by (command_group, command_name)(rate(hystrix_command_latency_total_seconds_count[5m]))
like image 131
brian-brazil Avatar answered Sep 20 '22 13:09

brian-brazil


Finally I have a solution for my specific problem:

Having a devision by zero leads to a NaN display - that is fine as a technical result and correct but not what the user wants to see (does not fulfil the business requirement).

So I searched a bit and found a "solution" for my problem in the grafana community:

Surround your problematic value with max(YOUR_PROLEMATIC_QUERY, or vector(-1)). An additional value mapping then leads to a useful output.

(Of course you have to adapt the solution to your problem... min/max... vector(42)/vector(101)/vector(...))

Update (1)

Okay. However. It seems to be a bit more tricky based on the query. For example I have another query that fails with NaN as a result of a devision by zero. The above solution does not work. I had to surround the query with brackets and added > 0 or on() vector(100).

like image 27
eventhorizon Avatar answered Sep 20 '22 13:09

eventhorizon