How dangerous are high-cardinality labels in Prometheus?

Tags:

prometheus

I'm considering exporting some metrics to Prometheus, and I'm getting nervous about what I'm planning to do.

My system consists of a workflow engine, and I'd like to track some metrics for each step in the workflow. This seems reasonable, with a gauge metric called wfengine_step_duration_seconds. My issue is that there are many thousands of steps across all my workflows.

According to the documentation here, I'm not supposed to programmatically generate any part of the name. That precludes, then, the use of names such as wfengine_step1_duration_seconds and wfengine_step2_duration_seconds, because the step names are programmatic (they change from time to time).

The solution, then is a label for the step names. This also presents a problem, though, because the documentation here and here cautions quite strongly against using labels with high cardinality. Specifically, they recommend keeping "the cardinality of your metrics below 10", and for cardinality over 100, "investigate alternate solutions such as reducing the number of dimensions or moving the analysis away from monitoring".

I'm looking at a number of label values in the low thousands (1,000 to 10,000). Given that the number of metrics otherwise won't be extremely large, is this an appropriate usage of Prometheus, or should I limit myself to more generic metrics, such as a single aggregated step duration instead of individual duration for each step?

740

asked Sep 22 '17 21:09

Mark

1 Answers

The guideline of staying under 100 cardinality for your biggest metrics presumes that you have 1000 replicas of your service, as that's a reasonably safe upper bound. If you know that everyone using this code will always have a lower number of replicas, then there's scope to have a higher cardinality in instrumentation.

Saying that, thousands of labels is still something to be careful with. If it's already tens of thousands, how long before it's hundreds of thousands? Long term you'll likely have to move this data to logs given the cardinality, so you may wish to do so now.

161

answered Sep 29 '22 22:09

brian-brazil

Related questions
                            
                                custom path for prometheus actuator
                            
                                Prometheus 2.x Limit Memory Usage
                            
                                Prometheus how to handle counters on server
                            
                                How do I delete a time series from Prometheus v2, specifically a series of alerts
                            
                                Prometheus return no data when calculating a ratio of two metrics
                            
                                How to silence Prometheus Alertmanager using config files?
                            
                                Prometheus - exclude 0 values from query result
                            
                                Filter prometheus results by metric value, not by label value
                            
                                How to automatically test Prometheus alerts?
                            
                                Dynamically add targets to a Prometheus configuration
                            
                                prometheus doesn't match regex query
                            
                                What is the maximum scrape_interval in Prometheus
                            
                                How do I get a pod's (milli)core CPU usage with Prometheus in Kubernetes?
                            
                                Get total and free disk space using Prometheus
                            
                                Prometheus rate functions and interval selections
                            
                                How to monitor disk usage of kubernetes persistent volumes?
                            
                                How to monitor disk usage of persistent volumes?
                            
                                How to gracefully avoid divide by zero in Prometheus

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With