I have a question about calculating response times with Prometheus summary metrics.
I created a summary metric that does not only contain the service name but also the complete path and the http-method.
Now I try to calculate the average response time for the complete service. I read the article about "rate then sum" and either I do not understand how the calculation is done or the calculation is IMHO not correct.
As far as I read this should be the correct way to calculate the response time per second:
sum by(service_id) (
rate(request_duration_sum{status_code=~"2.*"}[5m])
/
rate(request_duration_count{status_code=~"2.*"}[5m])
)
What I understand here is create the "duration per second" (rate sum / rate count) value for each subset and then creates the sum per service_id.
This looks absolutely wrong for me - but I think it does not work in the way I understand it.
Another way to get an equal looking result is this:
sum without (path,host) (
rate(request_duration_sum{status_code=~"2.*"}[5m])
/
rate(request_duration_count{status_code=~"2.*"}[5m])
)
If I would ignore everything I read I would try it in the following way:
rate(sum by(service_id) request_duration_sum{status_code=~"2.*"}[5m])
/
rate(sum by(service_id) request_duration_count{status_code=~"2.*"}[5m])
But this will not work at all... (instant vector vs range vector and so on...).
Aggregation operators. Prometheus supports the following built-in aggregation operators that can be used to aggregate the elements of a single instant vector, resulting in a new vector of fewer elements with aggregated values: sum (calculate sum over dimensions) min (select minimum over dimensions)
rate() : This calculates the rate of increase per second, averaged over the entire provided time window. Example: rate(http_requests_total[5m]) yields the per-second rate of HTTP requests as averaged over a time window of 5 minutes.
Prometheus provides a functional query language called PromQL (Prometheus Query Language) that lets the user select and aggregate time series data in real time.
So the end result of rate(http_requests_total[5m]) is a per-second average rps for the last 5 minutes, which is calculated individually per each time series with http_requests_total name.
All of these examples are aggregating incorrectly, as you're averaging an average. You want:
sum without (path,host) (
rate(request_duration_sum{status_code=~"2.*"}[5m])
)
/
sum without (path,host) (
rate(request_duration_count{status_code=~"2.*"}[5m])
)
Which will return the average latency per status_code
plus any other remaining labels.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With