I'm considering exporting some metrics to Prometheus, and I'm getting nervous about what I'm planning to do.
My system consists of a workflow engine, and I'd like to track some metrics for each step in the workflow. This seems reasonable, with a gauge metric called wfengine_step_duration_seconds
. My issue is that there are many thousands of steps across all my workflows.
According to the documentation here, I'm not supposed to programmatically generate any part of the name. That precludes, then, the use of names such as wfengine_step1_duration_seconds
and wfengine_step2_duration_seconds
, because the step names are programmatic (they change from time to time).
The solution, then is a label for the step names. This also presents a problem, though, because the documentation here and here cautions quite strongly against using labels with high cardinality. Specifically, they recommend keeping "the cardinality of your metrics below 10", and for cardinality over 100, "investigate alternate solutions such as reducing the number of dimensions or moving the analysis away from monitoring".
I'm looking at a number of label values in the low thousands (1,000 to 10,000). Given that the number of metrics otherwise won't be extremely large, is this an appropriate usage of Prometheus, or should I limit myself to more generic metrics, such as a single aggregated step duration instead of individual duration for each step?
The basic definition of cardinality is the number of elements in a given set. In the world of Prometheus and observability, label cardinality is extremely important because it impacts the performance and resource usage of your monitoring system.
Label cardinality is the average number of labels assigned to a document. Source publication. Large-Scale Multi-label Text Classification — Revisiting Neural Networks.
High cardinality refers to a column that can have many possible values. For an online shopping system, fields like userId , shoppingCartId , and orderId are often high-cardinality columns that can take take hundreds of thousands of distinct values. Similarly, requestId might be in the millions.
One option is downsampling. With standalone Prometheus, this typically means reducing the sampling rate of emitted metrics and thus the fidelity of a set of metrics. With Chronosphere, downsampling means choosing how and how long to store metrics.
The guideline of staying under 100 cardinality for your biggest metrics presumes that you have 1000 replicas of your service, as that's a reasonably safe upper bound. If you know that everyone using this code will always have a lower number of replicas, then there's scope to have a higher cardinality in instrumentation.
Saying that, thousands of labels is still something to be careful with. If it's already tens of thousands, how long before it's hundreds of thousands? Long term you'll likely have to move this data to logs given the cardinality, so you may wish to do so now.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With