Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get Graphite to simply count counters, not time-rate them

I'm using Graphite and Collectd to monitor my server. In particular, I'm using the tail pluggin to count failed SSH logins. I'm using a counter for this metric, so expect to see 1, 2, 3, 0, etc... for data points. However, what I'm seeing is 0.1, 0.2, 0.3, 0, etc... It seems to me like Graphite is providing counts-per-second. I say this because my retention policy is one data point every 10 seconds for two hours. So 1 failed login per 10 seconds = 0.1 per second. I'm looking at this in a graph. It looks like this:

Image

Furthermore, when I scale out to the next retention level, the numbers get adjusted accordingly: so 1 failed login which was shown as 0.1 is now shown as much less than this: 0.017 or something.

I don't think this is related to the aggregation method used: even the finest data is off. How can I get Graphite to treat this metric as a pure, raw, counter?

Here's my storage-schemas.conf (the retention policy):

[my_server]
pattern = .*
retentions = 10s:2h,1m:2d,30m:400d

Here's my configuration of the collectd tail plugin:

<Plugin "tail">
    <File "/var/log/auth.log">
            Instance "auth"
            <Match>
                    Regex "sshd[^:]*: Failed password"
                    DSType "CounterInc"
                    Type "counter"
                    Instance "sshd-invalid_user"
            </Match>
    </File>
</Plugin>

And here's my configuration of the write_graphite pluggin (which sends data to graphite):

<Plugin write_graphite>
    <Node "my_server_name">
            Host "localhost"
            Port "2003"
            Protocol "tcp"
            LogSendErrors true
            Prefix "collectd."
            #Postfix ""
            StoreRates true
            AlwaysAppendDS false
            EscapeCharacter "_"
    </Node>
</Plugin>

I tried setting StoreRates false for the write_graphite pluggin, but this didn't work. It did change the behaviour: when I performed a single failed SSH login, that metric shows as 1. However, it didn't drop back down to 0. When I performed two more failed logins, the metric pops up to 3.

Also of interest: I've also loaded the users pluggin which simply shows the number of users logged in and it's working great: shows 1 when I SSH in, two when I SSH in again, and back to 1 when I exit one SSH. For both settings of StoreRates. So it seems like what I want is possible somehow. Maybe not with the tail pluggin though.

The SSH logins with StoreRates false along with correct behaviour for Users Logged in can be seen in these graphs:

Image

Any ideas? Thanks,

like image 872
Cameron Lee Avatar asked Aug 17 '14 05:08

Cameron Lee


3 Answers

You are asking the system to count the number of events. And this is exactly what it's doing: it's counting the number of failed logins since its startup. Whether you're using StoreRates or not simply changes the way that information is displayed: as a rate or as the raw counter. A counter may never decrease! What you're actually asking for is a counter that resets itself upon reading: count the number of failed logins since the last time collectd checked.

As it happens the ABSOLUTE data source type in rrdtool can be used to achieve this, but that won't help you.

Step back, and think about what you're trying to achieve: the number of failed logins per second seems to me like a perfectly sane metric!

like image 149
faxmodem Avatar answered Sep 21 '22 01:09

faxmodem


Although swissunix's answer is very helpful, to achieve the behaviour I was looking for, I ended up using Logster instead of Collectd. With Logster, you write the bit of code that parses the file as well as the bit that returns the metric. So although dividing a count by the time is common with Logster, you don't have to do this if you don't want to: there's lots of flexibility.

I've put my parsers here: https://github.com/camlee/logster-parsers

like image 31
Cameron Lee Avatar answered Sep 22 '22 01:09

Cameron Lee


If you set StoreRates to false, in graphite you can apply the derivative function to the ever-increasing counter to get your rate of increase per retention interval, which would match your requirement.

E.g. in your example of reporting 1 failed login, then 2, you saw the values 1 and 3. The derivative is 1 and 2: the failed logs per interval that graphite tracks.

like image 32
Alec Henninger Avatar answered Sep 21 '22 01:09

Alec Henninger