Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prometheus: how to rate a sum of the same counter from different machines?

I have a Prometheus counter, for which I want to get its rate on a time range (the real target is to sum the rate, and sometimes use histogram_quantile on that for histogram metric).
However, I've got multiple machines running that kind of job, each one sets its own instance label. This causes different inc operations on this counter in different machines to create different entities of the counter, as the combination of labels values is unique.
The problem is that rate() works separately on each such counter entity.
The result is that counter entities with unique combinations don't get into account for rate().
For example, if I've got:

mycounter{aaa="1",instance="1.2.3.4:6666",job="job1"} value: 1
mycounter{aaa="2",instance="1.2.3.4:6666",job="job1"} value: 1
mycounter{aaa="2",instance="1.2.3.4:7777",job="job1"} value: 1
mycounter{aaa="1",instance="5.5.5.5:6666",job="job1"} value: 1

All counter entities are unique, so they get values of 1.
If counter labels are always unique (come from different machines), rate(mycounter[5m]) would get values of 0 in this case, and sum(rate(mycounter[5m])) would get 0, which is not what I need!
I want to ignore the instance label so that it would refer these mycounter inc operations as they were made on the same counter entity.
In other words, I expect to have only 2 entities (they can have a common instance value or no instance label):

mycounter{aaa="1", job="job1"} value: 2
mycounter{aaa="2", job="job1"} value: 2

In such a case, inc operation in new machine (with existing aaa value) would increase some entity counter instead of adding new entity with value of 1, and rate() would get real rates for each, so we may sum() them. How do I do that?

I made several tries to solve it but all failed:

  • Doing a rate() of the sum() - fails because of type mismatch...
  • Removing the automatic instance label, using metric_relabel_configswork with action: labeldrop in configuration, but then it assigns the default address value.
  • Changing all instance values to a common one using metric_relabel_configswork with replacement, but it seems that one of the entities overwrites all others, so it doesn't help...

Any suggestions?

Prometheus version: 2.3.2
Thanks in Advance!

like image 497
Amir Avatar asked Oct 08 '18 07:10

Amir


People also ask

What is rate function in Prometheus?

Prometheus rate function is the process of calculating the average per second rate of value increases. You would use this when you want to view how your server CPU usage has increased over a time range or how many requests come in over a time range and how that number increases.

How does counter work in Prometheus?

As you might have guessed from the name, a counter counts things. It does so in the simplest way possible, as its value can only increment but never decrement¹. Whilst it isn't possible to decrement the value of a running counter, it is possible to reset a counter. A reset happens on application restarts.

What is irate in Prometheus?

irate() irate(v range-vector) calculates the per-second instant rate of increase of the time series in the range vector. This is based on the last two data points. Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for.

How do you make a counter in Prometheus?

Use the rate() / irate() functions in Prometheus to calculate the rate of increase of a Counter. By convention, the names of Counters are suffixed by _total . To create a counter use either new/1 or declare/1 , the difference is that new/ will raise Prometheus.


1 Answers

You'd better expose your counters at 0 on application start, if the other labels (aaa, etc) have a limited set of possible combinations. This way rate() function works correctly at the bottom level and sum() will give you correct results.

If you have to do a rate() of the sum(), read this first:

Note that when combining rate() with an aggregation operator (e.g. sum()) or a function aggregating over time (any function ending in _over_time), always take a rate() first, then aggregate. Otherwise rate() cannot detect counter resets when your target restarts.

If you can tolerate this (or the instances reset counters at the same time), there's a way to work around. Define a recording rule as

record: job:mycounter:sum
expr: sum without(instance) (mycounter)

and then this expression works:

sum(rate(job:mycounter:sum[5m]))
like image 136
qingbo Avatar answered Oct 04 '22 12:10

qingbo