I am laying out an architecture where we will be using statsd and graphite. I understand how graphite works and how a single statsd server could communicate with it. I am wondering how the architecture and set up would work for scaling out statsd servers. Would you have multiple node statsd servers and then one central statsd server pushing to graphite? I couldn't seem to find anything about scaling out statsd and any ideas of how to have multiple statsd servers would be appreciated.
StatsD is built to “Measure Anything, Measure Everything”. Instead of TCP, StatsD uses UDP, which provides desirable speed with little overhead as possible.
stat: the name of the gauge to set. value: the current value of the gauge. rate: a sample rate, a float between 0 and 1. Will only send data this percentage of the time. The statsd server does not take the sample rate into account for gauges.
How StatsD protocol works. The client library will format and encapsulate the metrics in a UDP network package and send them to a StatsD server. The server will collect and aggregate all the metrics and periodically submit them to a monitoring backend (or multiple backends).
I'm dealing with the same problem right now. Doing naive load-balancing between multiple statsds obviously doesn't work because keys with the same name would end up in different statsds and would thus be aggregated incorrectly.
But there are a couple of options for using statsd in an environment that needs to scale:
use client-side sampling for counter metrics, as described in the statsd documentation (i.e. instead of sending every event to statsd, send only every 10th event and make statsd multiply it by 10). The downside is that you need to manually set an appropriate sampling rate for each of your metrics. If you sample too few values, your results will be inaccurate. If you sample too much, you'll kill your (single) statsd instance.
build a custom load-balancer that shards by metric name to different statsds, thus circumventing the problem of broken aggregation. Each of those could write directly to Graphite.
build a statsd client that counts events locally and only sends them in aggregate to statsd. This greatly reduces the traffic going to statsd and also makes it constant (as long as you don't add more servers). As long as the period with which you send the data to statsd is much smaller than statsd's own flush period, you should also get similarly accurate results.
variation of the last point that I have implemented with great success in production: use a first layer of multiple (in my case local) statsds, which in turn all aggregate into one central statsd, which then talks to Graphite. The first layer of statsds would need to have a much smaller flush time than the second. To do this, you will need a statsd-to-statsd backend. Since I faced exactly this problem, I wrote one that tries to be as network-efficient as possible: https://github.com/juliusv/ne-statsd-backend
As it is, statsd was unfortunately not designed to scale in a manageable way (no, I don't see adjusting sampling rates manually as "manageable"). But the workarounds above should help if you are stuck with it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With