Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between PromQL "by" and "without" unclear

I have a question about calculating response times with Prometheus summary metrics.

I created a summary metric that does not only contain the service name but also the complete path and the http-method.

Now I try to calculate the average response time for the complete service. I read the article about "rate then sum" and either I do not understand how the calculation is done or the calculation is IMHO not correct.

As far as I read this should be the correct way to calculate the response time per second:

sum by(service_id) (
    rate(request_duration_sum{status_code=~"2.*"}[5m])
    /
    rate(request_duration_count{status_code=~"2.*"}[5m])
)

What I understand here is create the "duration per second" (rate sum / rate count) value for each subset and then creates the sum per service_id.

This looks absolutely wrong for me - but I think it does not work in the way I understand it.

Another way to get an equal looking result is this:

sum without (path,host) (
    rate(request_duration_sum{status_code=~"2.*"}[5m])
    /
    rate(request_duration_count{status_code=~"2.*"}[5m])
)
  • But what is the difference?
  • What is really happening here?
  • And why do I honestly only get measurable values if I use "max" instead of "sum"?

If I would ignore everything I read I would try it in the following way:

rate(sum by(service_id) request_duration_sum{status_code=~"2.*"}[5m])
/
rate(sum by(service_id) request_duration_count{status_code=~"2.*"}[5m])

But this will not work at all... (instant vector vs range vector and so on...).

like image 547
eventhorizon Avatar asked Jun 27 '18 14:06

eventhorizon


People also ask

What is sum by in Prometheus?

Aggregation operators. Prometheus supports the following built-in aggregation operators that can be used to aggregate the elements of a single instant vector, resulting in a new vector of fewer elements with aggregated values: sum (calculate sum over dimensions) min (select minimum over dimensions)

What is rate in PromQL?

rate() : This calculates the rate of increase per second, averaged over the entire provided time window. Example: rate(http_requests_total[5m]) yields the per-second rate of HTTP requests as averaged over a time window of 5 minutes.

What is PromQL query?

Prometheus provides a functional query language called PromQL (Prometheus Query Language) that lets the user select and aggregate time series data in real time.

What is 5m PromQL?

So the end result of rate(http_requests_total[5m]) is a per-second average rps for the last 5 minutes, which is calculated individually per each time series with http_requests_total name.


1 Answers

All of these examples are aggregating incorrectly, as you're averaging an average. You want:

  sum without (path,host) (
    rate(request_duration_sum{status_code=~"2.*"}[5m])
  )
/
  sum without (path,host) (
    rate(request_duration_count{status_code=~"2.*"}[5m])
  )

Which will return the average latency per status_code plus any other remaining labels.

like image 181
brian-brazil Avatar answered Oct 22 '22 08:10

brian-brazil