Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What use cases really make prometheus's summary metrics type necessary/unique?

Tags:

prometheus

For Prometheus metrics collection, like title, I could not really find a use case which only can be done via the type Summary, seems that they all somehow can be done via the type Histogram also.

Lets take the request concurrency metrics as example, no doubt this can be perfectly done via type Summary, but i can also achieve the same effect by using type Histogram, as below:

rate(http_request_duration_seconds_sum[1s]) / rate(http_request_duration_seconds_count[1s])

The only difference I can see is: for a summary the percentiles are computed in the client, it is made of a count and sum counters (like in Histogram type) and resulting quantile values.

So I am a bit lost on what use cases really make the type Summary necessary/unique, please help to inspire me.

like image 368
lnshi Avatar asked Jul 03 '18 04:07

lnshi


2 Answers

The Summary metric is not unique, many other instrumentation systems offer similar - such as Dropwizard's Histogram type (it's a histogram internally, but exposed as a quantile). This is one reason it exists, so such types from other instrumentation systems can be mapped more cleanly.

Another reason it exists is historical. In Prometheus the Summary came before the Histogram, and the general recommendation is to use a Histogram as it's aggregatable where the Summary's quantiles are not. On the other hand histograms require you to pre-select buckets in order to be aggregatable and allow analysis over arbitrary time frames.

There is a longer comparison of the two types in the docs.

like image 192
brian-brazil Avatar answered Nov 15 '22 14:11

brian-brazil


Prometheus summary metric type is useful when there is set of pre-defined percentiles, which must be exposed for some metric such as request duration or response size, and there is no need in calculating aggregate percentiles over multiple metrics. For example, if you need to measure 90th, 97th and 99th percentile for request duration on a single server, then the following metrics composing Prometheus summary would be useful to export:

http_request_duration_seconds{quantile="0.99"}
http_request_duration_seconds{quantile="0.97"}
http_request_duration_seconds{quantile="0.90"}

Another common reason why users prefer Prometheus summary type over Prometheus histogram type is that summary metrics are easier to understand and to deal with.

The summary metric type has the following limitations comparing to histogram metric type:

  • Summary metric type doesn't allow calculating percentiles other than the already pre-defined percentiles. For example, if a summary metric exposes only 0.9 and 0.95 percentiles, then it is impossible to calculate 0.99 or 0.5 percentile from the collected data.
  • Summary metric type doesn't allow calculating aggregate percentiles over multiple summary metrics. For example, if the http_request_duration_seconds{quantile="0.99"} metric is exposed individually per each server in a cluster, then it is impossible to calculate the 99th percentile for request duration over all the servers in the cluster. Users sometimes use avg(http_request_duration_seconds{quantile="0.99"}) or max(http_request_duration_seconds{quantile="0.99"}) as a workaround, but the resulting value may be far from the actual percentile.

The histogram metric type in Prometheus also has its own issues:

  • Too low precision for calculated percentiles when the exported histogram buckets have insufficient coverage for the measurement. For example, if http_request_duration_seconds histogram has the following buckets: [0-0.1], (0.1-1.0], (1.0-10.0] - and the majority of requests are executed in 0.5 seconds, then all these requests will go to the [0.1-1.0] bucket. But it is impossible to calculate any percentile with good precision from such a data.

  • Too big number of exported buckets. When users stumble upon the first issue, the most common reaction is to create big number of buckets in order to have good coverage over the measurement. This may lead to high cardinality issues, since each bucket is exposed as a separate metric (aka time series).

  • Inability to aggregate histograms with distinct sets of buckets. For example, the http_request_duration_seconds histogram may have distinct sets of buckets per each monitored service. Then it is impossible to calculate percentile for this histogram over multiple services.

These issues are solved in VictoriaMetrics histogram type - see this article for details.

like image 34
valyala Avatar answered Nov 15 '22 16:11

valyala