According to Prometheus documentation in order to have a 95th percentile using histogram metric I can use following query:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Source: https://prometheus.io/docs/practices/histograms/#quantiles
Since each bucket of histogram is a counter we can calculate rate each of the buckets as:
per-second average rate of increase of the time series in the range vector.
See: https://prometheus.io/docs/prometheus/latest/querying/functions/#rate
So, for instance, if bucket value[t-5m] = 100 and bucket value[t] = 200 then bucket rate[t] = (200-100)/(10*60) = 0.167
And finally, the most confusing part is how can histogram_quantile function find 95th percentile for given metric knowing all the bucket rates?
Is there any code or algorithm I can take a look to better understand it?
A solid example will explain histogram_quantile
well.
Assumptions:
http_request_duration_seconds
.10ms, 50ms, 100ms, 200ms, 300ms, 500ms, 1s, 2s, 3s, 5s
http_request_duration_seconds
is a metric type of COUNTER
time | value | delta | rate (quantity of items) |
---|---|---|---|
t-10m | 50 | N/A | N/A |
t-5m | 100 | 50 | 50 / (5*60) |
t | 200 | 100 | 100 / (5*60) |
... | ... | ... | ... |
rate()
to calculate the quantity
for each bucket
rate_xxx(t) = (value_xxx[t]-value_xxx[t-5m]) / (5m*60)
is thequantity of items
for[t-5m, t]
value(t)
and value(t-5m)
) here.10000
http request durations (items
) were recorded, that is,10000 = rate_10ms(t) + rate_50ms(t) + rate_100ms(t) + ... + rate_5s(t)
.bucket(le) | 10ms | 50ms | 100ms | 200ms | 300ms | 500ms | 1s | 2s | 3s | 5s | +Inf |
---|---|---|---|---|---|---|---|---|---|---|---|
range | ~10ms | 10~50ms | 50~100ms | 100~200ms | 200~300ms | 300~500ms | 500ms~1s | 1~2s | 2s~3s | 3~5s | 5s~ |
rate_xxx(t) | 3000 | 3000 | 1500 | 1000 | 800 | 400 | 200 | 40 | 30 | 5 | 5 |
Bucket is the essence of histogram. We just need 10 numbers in rate_xxx(t)
to do the quantile calculation
Let's take a close look at this expression (aggregation like sum()
is omitted for simplicity)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
We are actually looking for the 95%th
item in rate_xxx(t)
from bucket=10ms
to bucket=+Inf
. And 95%th
means 9500th
here since we got 10000
items in total (10000 * 0.95
).
From the table above, there are 9300 = 3000+3000+1500+1000+800
items together before bucket=500ms
.
So the 9500th
item is the 200th
item (9500-9300
) in bucket=500ms
(range=300~500ms
) which got 400
items within
And Prometheus assumes that items in a bucket spread evenly in a linear pattern.
The metric value for the 200th
item in bucket=500ms
is 400ms = 300+(500-300)*(200/400)
That is, 95%
is 400ms
.
There are a few to bear in mind
COUNTER
in nature for histogram metric typele
definedPrometheus makes this assumption at least
histogram_quantile
is an approximationP.S.:
The metric value is not always accurate
due to the assumption of Items (Data) in a specific bucket spread evenly a linear pattern
Say, the max duration in reality (e.g.: from nginx access log) in bucket=500ms
(range=300~500ms
) is 310ms
, however, we will get 400ms
from histogram_quantile
via above setup which is quite confusing sometimes.
The smaller bucket distance is, the more accurate approximation
is.
So setup the bucket distances that fit your needs.
I believe this is the code for it in prometheus
The general idea is that you use the data in the buckets to extrapolate / approximate the quantiles
Elasticsearch also does something similar (yet different/much simpler) in their rollup capabilities
You can refer to my reply here
Actually the rate() function is just used to specify the time window, the denominator has no effect in the computation of the pecentile value.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With