I am new to Prometheus and made a query below trying to display the average up-time of a certain website in percentage for SLA monitoring (let's say Google for example).
(avg_over_time(probe_success{instance="https://www.google.com/"}[$__range])) * 100
However, is it possible to make the calculate ignore any down-time where it is less than 1 minute?
The best way to go about SLAs for probes is using quantile function like:
quantile_over_time(0.99, probe_success{instance="https://www.google.com/"}[$__range])
It is not exactly this query, but one needs to think from the basic with quantiles in mind.
That said, to answer the question directly, avoiding 1-min downtimes, this can help:
avg_over_time(((avg_over_time(probe_success{instance="https://www.google.com"}[75s]) * 75) > bool(60))[$__range:]) * 100
Lets dissect this query now:
avg_over_time(probe_success{instance="https://www.google.com"}[75s]) gets average of the probe over 75s, so we can try and ignore 1m downtimes. Call this UP_TIME_PERCENTAGE.
UP_TIME_PERCENTAGE * 75 provides the up time in seconds over the past 75s. Call this UP_TIME_75S.
UP_TIME_75S > bool(60) provides a boolean 1 or 0 timeline, indicating whether the uptime was more than a minute. Call this IS_UP_MORE_THAN_1M
avg_over_time(IS_UP_MORE_THAN_1M[$__range:]) * 100 results in the percentage of probes with up time more than 1m in the given $__range. Note the :. It is important to apply ..._over_time method on sub-queries.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With