Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding histogram_quantile based on rate in Prometheus

According to Prometheus documentation in order to have a 95th percentile using histogram metric I can use following query:

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Source: https://prometheus.io/docs/practices/histograms/#quantiles

Since each bucket of histogram is a counter we can calculate rate each of the buckets as:

per-second average rate of increase of the time series in the range vector.

See: https://prometheus.io/docs/prometheus/latest/querying/functions/#rate

So, for instance, if bucket value[t-5m] = 100 and bucket value[t] = 200 then bucket rate[t] = (200-100)/(10*60) = 0.167

And finally, the most confusing part is how can histogram_quantile function find 95th percentile for given metric knowing all the bucket rates?

Is there any code or algorithm I can take a look to better understand it?

like image 887
evgeniy44 Avatar asked Mar 14 '19 12:03

evgeniy44


Video Answer


3 Answers

A solid example will explain histogram_quantile well.

Assumptions:

  • ONLY ONE series for simplicity
  • 10 buckets for metric http_request_duration_seconds.

10ms, 50ms, 100ms, 200ms, 300ms, 500ms, 1s, 2s, 3s, 5s

  • http_request_duration_seconds is a metric type of COUNTER
time value delta rate (quantity of items)
t-10m 50 N/A N/A
t-5m 100 50 50 / (5*60)
t 200 100 100 / (5*60)
... ... ... ...
  • We have at least two scrapes of the series covering 5 minutes for rate() to calculate the quantity for each bucket

rate_xxx(t) = (value_xxx[t]-value_xxx[t-5m]) / (5m*60) is the quantity of items for [t-5m, t]

  • We are looking at 2 samples(value(t) and value(t-5m)) here.
  • 10000 http request durations (items) were recorded, that is,
    10000 = rate_10ms(t) + rate_50ms(t) + rate_100ms(t) + ... + rate_5s(t).
bucket(le) 10ms 50ms 100ms 200ms 300ms 500ms 1s 2s 3s 5s +Inf
range ~10ms 10~50ms 50~100ms 100~200ms 200~300ms 300~500ms 500ms~1s 1~2s 2s~3s 3~5s 5s~
rate_xxx(t) 3000 3000 1500 1000 800 400 200 40 30 5 5

Bucket is the essence of histogram. We just need 10 numbers in rate_xxx(t) to do the quantile calculation

Let's take a close look at this expression (aggregation like sum() is omitted for simplicity)

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

We are actually looking for the 95%th item in rate_xxx(t) from bucket=10ms to bucket=+Inf. And 95%th means 9500th here since we got 10000 items in total (10000 * 0.95).
From the table above, there are 9300 = 3000+3000+1500+1000+800 items together before bucket=500ms.

So the 9500th item is the 200th item (9500-9300) in bucket=500ms(range=300~500ms) which got 400 items within

And Prometheus assumes that items in a bucket spread evenly in a linear pattern.
The metric value for the 200th item in bucket=500ms is 400ms = 300+(500-300)*(200/400)

That is, 95% is 400ms.

There are a few to bear in mind

  • Metric should be COUNTER in nature for histogram metric type
  • Series for quantile calculation should always get label le defined
  • Items (Data) in a specific bucket spread evenly a linear pattern (e.g.: 300~500ms)

Prometheus makes this assumption at least

  • Quantile calculation requires buckets being sorted(defined) in some ascending/descending order (e.g.: 1ms < 5ms < 10ms < ...)
  • Result of histogram_quantile is an approximation

P.S.:
The metric value is not always accurate due to the assumption of Items (Data) in a specific bucket spread evenly a linear pattern

Say, the max duration in reality (e.g.: from nginx access log) in bucket=500ms(range=300~500ms) is 310ms, however, we will get 400ms from histogram_quantile via above setup which is quite confusing sometimes.

The smaller bucket distance is, the more accurate approximation is.
So setup the bucket distances that fit your needs.

like image 78
Ace Avatar answered Oct 20 '22 01:10

Ace


I believe this is the code for it in prometheus
The general idea is that you use the data in the buckets to extrapolate / approximate the quantiles Elasticsearch also does something similar (yet different/much simpler) in their rollup capabilities

like image 29
Elad Amit Avatar answered Oct 20 '22 02:10

Elad Amit


You can refer to my reply here

Actually the rate() function is just used to specify the time window, the denominator has no effect in the computation of the pecentile value.

like image 2
howardxking Avatar answered Oct 20 '22 02:10

howardxking