What use cases really make prometheus's summary metrics type necessary/unique?

Question

For Prometheus metrics collection, like title, I could not really find a use case which only can be done via the type Summary, seems that they all somehow can be done via the type Histogram also.

Lets take the request concurrency metrics as example, no doubt this can be perfectly done via type Summary, but i can also achieve the same effect by using type Histogram, as below:

rate(http_request_duration_seconds_sum[1s]) / rate(http_request_duration_seconds_count[1s])

The only difference I can see is: for a summary the percentiles are computed in the client, it is made of a count and sum counters (like in Histogram type) and resulting quantile values.

So I am a bit lost on what use cases really make the type Summary necessary/unique, please help to inspire me.

brian-brazil · Accepted Answer

The Summary metric is not unique, many other instrumentation systems offer similar - such as Dropwizard's Histogram type (it's a histogram internally, but exposed as a quantile). This is one reason it exists, so such types from other instrumentation systems can be mapped more cleanly.

Another reason it exists is historical. In Prometheus the Summary came before the Histogram, and the general recommendation is to use a Histogram as it's aggregatable where the Summary's quantiles are not. On the other hand histograms require you to pre-select buckets in order to be aggregatable and allow analysis over arbitrary time frames.

There is a longer comparison of the two types in the docs.

valyala · Answer

Prometheus summary metric type is useful when there is set of pre-defined percentiles, which must be exposed for some metric such as request duration or response size, and there is no need in calculating aggregate percentiles over multiple metrics. For example, if you need to measure 90th, 97th and 99th percentile for request duration on a single server, then the following metrics composing Prometheus summary would be useful to export:

http_request_duration_seconds{quantile="0.99"}
http_request_duration_seconds{quantile="0.97"}
http_request_duration_seconds{quantile="0.90"}

Another common reason why users prefer Prometheus summary type over Prometheus histogram type is that summary metrics are easier to understand and to deal with.

The summary metric type has the following limitations comparing to histogram metric type:

Summary metric type doesn't allow calculating percentiles other than the already pre-defined percentiles. For example, if a summary metric exposes only 0.9 and 0.95 percentiles, then it is impossible to calculate 0.99 or 0.5 percentile from the collected data.
Summary metric type doesn't allow calculating aggregate percentiles over multiple summary metrics. For example, if the http_request_duration_seconds{quantile="0.99"} metric is exposed individually per each server in a cluster, then it is impossible to calculate the 99th percentile for request duration over all the servers in the cluster. Users sometimes use avg(http_request_duration_seconds{quantile="0.99"}) or max(http_request_duration_seconds{quantile="0.99"}) as a workaround, but the resulting value may be far from the actual percentile.

The histogram metric type in Prometheus also has its own issues:

Too low precision for calculated percentiles when the exported histogram buckets have insufficient coverage for the measurement. For example, if http_request_duration_seconds histogram has the following buckets: [0-0.1], (0.1-1.0], (1.0-10.0] - and the majority of requests are executed in 0.5 seconds, then all these requests will go to the [0.1-1.0] bucket. But it is impossible to calculate any percentile with good precision from such a data.
Too big number of exported buckets. When users stumble upon the first issue, the most common reaction is to create big number of buckets in order to have good coverage over the measurement. This may lead to high cardinality issues, since each bucket is exposed as a separate metric (aka time series).
Inability to aggregate histograms with distinct sets of buckets. For example, the http_request_duration_seconds histogram may have distinct sets of buckets per each monitored service. Then it is impossible to calculate percentile for this histogram over multiple services.

These issues are solved in VictoriaMetrics histogram type - see this article for details.

What use cases really make prometheus's summary metrics type necessary/unique?

Tags:

prometheus

lnshi

2 Answers

brian-brazil

valyala

Recent Activity

Donate For Us

What use cases really make prometheus's summary metrics type necessary/unique?

Tags:

prometheus

lnshi

2 Answers

brian-brazil

valyala

Related questions

Recent Activity

Donate For Us