Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get consistent values with influxdb non_negative_derivative?

Using grafana with influxdb, I am trying to show the per-second rate of some value that is a counter. If I use the non_negative_derivative(1s) function, the value of the rate seems to change dramatically depending on the time width of the grafana view. I'm using the last selector (but could also use max which is the same value since it is a counter).

Specifically, I'm using:

SELECT non_negative_derivative(last("my_counter"), 1s) FROM ...

According to the influxdb docs non-negative-derivative:

InfluxDB calculates the difference between chronological field values and converts those results into the rate of change per unit.

So to me, that means that the value at a given point should not change that much when expanding the time view, since the value should be rate of change per unit (1s in my example query above).

In graphite, they have the specific perSecond function, which works much better:

perSecond(consolidateBy(my_counter, 'max'))

Any ideas on what I'm doing wrong with the influx query above?

like image 711
Ben Avatar asked Jun 24 '16 14:06

Ben


3 Answers

If you want per second results that don't vary, you'll want to GROUP BY time(1s). This will give you accurate perSecond results.

Consider the following example:

Suppose that the value of the counter at each second changes like so

0s → 1s → 2s → 3s → 4s
1  → 2  → 5  → 8  → 11

Depending on how we group the sequence above, we'll see different results.

Consider the case where we group things into 2s buckets.

 0s-2s   →    2s-4s
(5-1)/2  →  (11-5)/2
   2     →      3

versus the 1s buckets

 0s-1s  →  1s-2s  →  2s-3s  →  3s-4s
(2-1)/1 → (5-2)/1 → (8-5)/1 → (11-8)/1
   1    →    3    →    3    →    3

Addressing

So to me, that means that the value at a given point should not change that much when expanding the time view, since the value should be rate of change per unit (1s in my example query above).

The rate of change per unit is a normalizing factor, independent of the GROUP BY time unit. Interpreting our previous example when we change the derivative interval to 2s may offer some insight.

The exact equation is

∆y/(∆x/tu)

Consider the case where we group things into 1s buckets with a derivative interval of 2s. The result we should see is

 0s-1s    →  1s-2s    →  2s-3s    →  3s-4s
2*(2-1)/1 → 2*(5-2)/1 → 2*(8-5)/1 → (11-8)/1
   2      →    6      →    6      →    6

This may seem a bit odd, but if you consider what this says it should make sense. When we specify a derivative interval of 2s what we're asking for is what the 2s rate of change is for the 1s GROUP BY bucket.

If we apply similar reasoning to the case of 2s buckets with a derivative interval of 2s is then

 0s-2s     →    2s-4s
2*(5-1)/2  →  2*(11-5)/2
   4       →      6

What we're asking for here is what the 2s rate of change is for the 2s GROUP BY bucket and in the first interval the 2s rate of change would be 4 and the second interval the 2s rate of change would be 6.

like image 135
Michael Desa Avatar answered Nov 20 '22 19:11

Michael Desa


@Michael-Desa gives an excellent explanation.

I'd like to augment that answer with a solution to a pretty common metric our company is interested in: "What is the maximum "operation per second" value on a specific measurement field?".

I will use a real-life example from our company.

Scenario Background

We send a lot of data from an RDBMS to redis. When transferring that data, we keep track of 5 counters:

  1. TipTrgUp -> Updates by a business trigger (stored procedure)
  2. TipTrgRm -> Removes by a business trigger (stored procedure)
  3. TipRprUp -> Updates by an unattended auto-repair batch process
  4. TipRprRm -> Removes by an unattended auto-repair batch process
  5. TipDmpUp -> Updates by a bulk-dump process

We made a metrics collector that sends the current state of these counters to InfluxDB, with an interval of 1 second (configurable).

Grafana graph 1: low resolution, no true max ops

Here is the grafana query that is useful, but does not show the true max ops when zoomed out (we know it will go to around 500 ops on a normal business day, when no special dumps or maintenance is taking place - otherwise it goes into the thousands):

SELECT
    non_negative_derivative(max(TipTrgUp),1s) AS "update/TipTrgUp"
   ,non_negative_derivative(max(TipTrgRm),1s) AS "remove/TipTrgRm"
   ,non_negative_derivative(max(TipRprUp),1s) AS "autorepair-up/TipRprUp"
   ,non_negative_derivative(max(TipRprRm),1s) AS "autorepair-rm/TipRprRm"
   ,non_negative_derivative(max(TipDmpUp),1s) AS "dump/TipDmpUp"
FROM "$rp"."redis_flux_-transid-d-s"
WHERE
    host =~ /$server$/
    AND $timeFilter
GROUP BY time($interval),* fill(null)

Sidenotes: $rp is the name of the retention policy, templated in grafana. We use CQ's to downsample to retention policies with a larger duration. Also note the 1s as a derivative parameter: it is needed, since the default is different when using GROUP BY. This can be easily overlooked in the InfluxDB documentation.

The graph, seen by 24 hours looks like this: enter image description here

If we simply use a resolution of 1s (as suggested by @Michael-Desa), an enormous amount of data is transferred from influxdb to the client. It works reasonably well (about 10 seconds), but too slow for us.

Grafana graph 2: low and high resolution, true max ops, slow performance

We can however use subqueries to add the true maxops to this graph, which is a slight improvement. A lot less data is transferred to the client, but the InfluxDB server has to do a lot of number crunching. Series B (with maxops prepended in the aliases):

SELECT
    max(subTipTrgUp) AS maxopsTipTrgUp
   ,max(subTipTrgRm) AS maxopsTipTrgRm
   ,max(subTipRprUp) AS maxopsRprUp
   ,max(subTipRprRm) AS maxopsTipRprRm
   ,max(subTipDmpUp) AS maxopsTipDmpUp
FROM (
    SELECT
        non_negative_derivative(max(TipTrgUp),1s) AS subTipTrgUp
       ,non_negative_derivative(max(TipTrgRm),1s) AS subTipTrgRm
       ,non_negative_derivative(max(TipRprUp),1s) AS subTipRprUp
       ,non_negative_derivative(max(TipRprRm),1s) AS subTipRprRm
       ,non_negative_derivative(max(TipDmpUp),1s) AS subTipDmpUp
    FROM "$rp"."redis_flux_-transid-d-s"
    WHERE
        host =~ /$server$/
        AND $timeFilter
    GROUP BY time(1s),* fill(null)
)
WHERE $timeFilter
GROUP BY time($interval),* fill(null)

Gives: enter image description here

Grafana graph 3: low and high resolution, true max ops, high performance, pre-calculate by CQ

Our final solution to these kind of metrics (but only when we need a live view, the subquery approach works fine for ad-hoc graphs) is: use a Continuous Query to pre-calculate the true maxops. We generate CQ's like this:

CREATE CONTINUOUS QUERY "redis_flux_-transid-d-s.maxops.1s"
ON telegraf
BEGIN
    SELECT
        non_negative_derivative(max(TipTrgUp),1s) AS TipTrgUp
       ,non_negative_derivative(max(TipTrgRm),1s) AS TipTrgRm
       ,non_negative_derivative(max(TipRprUp),1s) AS TipRprUp
       ,non_negative_derivative(max(TipRprRm),1s) AS TipRprRm
       ,non_negative_derivative(max(TipDmpUp),1s) AS TipDmpUp
    INTO telegraf.A."redis_flux_-transid-d-s.maxops"
    FROM telegraf.A."redis_flux_-transid-d-s"
    GROUP BY time(1s),*
END

From here on, it's trivial to use these maxops measurements in grafana. When downsampling to an RP with longer retention, we again use max() as the selector function.

Series B (with .maxops appended in the aliases)

SELECT
    max(TipTrgUp) AS "update/TipTrgUp.maxops"
   ,max(TipTrgRm) AS "remove/TipTrgRm.maxops"
   ,max(TipRprUp) as "autorepair-up/TipRprUp.maxops"
   ,max(TipRprRm) as "autorepair-rm/TipRprRm.maxops"
   ,max(TipDmpUp) as "dump/TipDmpUp.maxops"
FROM "$rp"."redis_flux_-transid-d-s.maxops"
WHERE
    host =~ /$server$/
    AND $timeFilter
GROUP BY time($interval),* fill(null)

Gives: enter image description here

When zoomed in to 1s precision, you can see that the graphs become identical: enter image description here

Hope this helps, TW

like image 35
Tw Bert Avatar answered Nov 20 '22 20:11

Tw Bert


The problem here is that the $__interval width changes depending on the time frame you are viewing in Grafana.

The way then to get consistent results is to take a sample from each interval (mean(), median(), or max() all work equally well) and then transform by derivative($__interval). That way your derivative changes to match your interval length as you zoom in/out.

So, your query might look like:

SELECT derivative(mean("mem.gc.count"), $__interval) FROM "influxdb"
WHERE $timeFilter GROUP BY time($__interval) fill(null)
like image 2
Dave Avatar answered Nov 20 '22 20:11

Dave