How to gracefully avoid divide by zero in Prometheus

Tags:

prometheus

There are times when you need to divide one metric by another metric.

For example, I'd like to calculate a mean latency like that:

rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
/
rate({__name__="hystrix_command_latency_total_seconds_count"}[60s])

If there is no activity during the specified time period, the rate() in the divider becomes 0 and the result of division becomes NaN. If I do some aggregation over the results (avg() or sum() or whatever), the whole aggregation result becomes NaN.

So I add a check for zero in divider:

rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
/
(rate({__name__="hystrix_command_latency_total_seconds_count"}[60s]) > 0)

This removes NaNs from the result vector. And also tears the line on the graph to shreds.

Let's mark periods of inactivity with 0 value to make the graph continuous again:

rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
/
(rate({__name__="hystrix_command_latency_total_seconds_count"}[60s]) > 0)
or
rate({__name__="hystrix_command_latency_total_seconds_count"}[60s]) > bool 0

This effectively replaces NaNs with 0, graph is continuous, aggregations work OK.

But resulting query is slightly cumbersome, especially when you need to do more label filtering and do some aggregations over results. Something like that:

avg(
    1000 * increase({__name__=~".*_hystrix_command_latency_total_seconds_sum", command_group=~"$commandGroup", command_name=~"$commandName", job=~"$service", instance=~"$instance"}[60s])
    /
    (increase({__name__=~".*_hystrix_command_latency_total_seconds_count", command_group=~"$commandGroup", command_name=~"$commandName", job=~"$service", instance=~"$instance"}[60s]) > 0)
    or
    increase({__name__=~".*_hystrix_command_latency_total_seconds_count", command_group=~"$commandGroup", command_name=~"$commandName", job=~"$service", instance=~"$instance"}[60s]) > bool 0
) by (command_group, command_name)

Long story short: Are there any simpler ways to deal with zeros in divider? Or any common practices?

304

asked Nov 01 '17 13:11

Yoory N.

2 Answers

If there is no activity during the specified time period, the rate() in the divider becomes 0 and the result of division becomes NaN.

This is the correct behaviour, NaN is what you want the result to be.

aggregations work OK.

You can't aggregate ratios. You need to aggregate the numerator and denominator separately and then divide.

So:

   sum by (command_group, command_name)(rate(hystrix_command_latency_total_seconds_sum[5m]))
  /
   sum by (command_group, command_name)(rate(hystrix_command_latency_total_seconds_count[5m]))

131

answered Sep 20 '22 13:09

brian-brazil

Finally I have a solution for my specific problem:

Having a devision by zero leads to a NaN display - that is fine as a technical result and correct but not what the user wants to see (does not fulfil the business requirement).

So I searched a bit and found a "solution" for my problem in the grafana community:

Surround your problematic value with max(YOUR_PROLEMATIC_QUERY, or vector(-1)). An additional value mapping then leads to a useful output.

(Of course you have to adapt the solution to your problem... min/max... vector(42)/vector(101)/vector(...))

Update (1)

Okay. However. It seems to be a bit more tricky based on the query. For example I have another query that fails with NaN as a result of a devision by zero. The above solution does not work. I had to surround the query with brackets and added > 0 or on() vector(100).

answered Sep 20 '22 13:09

eventhorizon

Related questions
                            
                                Monitor custom kubernetes pod metrics using Prometheus
                            
                                custom path for prometheus actuator
                            
                                Prometheus 2.x Limit Memory Usage
                            
                                Prometheus how to handle counters on server
                            
                                How do I delete a time series from Prometheus v2, specifically a series of alerts
                            
                                Prometheus return no data when calculating a ratio of two metrics
                            
                                How to silence Prometheus Alertmanager using config files?
                            
                                Prometheus - exclude 0 values from query result
                            
                                Filter prometheus results by metric value, not by label value
                            
                                How to automatically test Prometheus alerts?
                            
                                Dynamically add targets to a Prometheus configuration
                            
                                prometheus doesn't match regex query
                            
                                What is the maximum scrape_interval in Prometheus
                            
                                How do I get a pod's (milli)core CPU usage with Prometheus in Kubernetes?
                            
                                Get total and free disk space using Prometheus
                            
                                Prometheus rate functions and interval selections
                            
                                How to monitor disk usage of kubernetes persistent volumes?
                            
                                How to monitor disk usage of persistent volumes?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With