Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error when using group_left in Prometheus

Tags:

prometheus

Getting error when trying to use group_left between two queries

The query is:

floor(avg_over_time(dcgm_gpu_utilization{cluster_name="researchers"}[5m]) * on (instance) group_left(node) max by (node) (kube_node_labels{label_grid="true"}))

And it shows this error:

Error executing query: found duplicate series for the match group {} on the right hand-side of the operation: [{node="gpu-m-08"}, {node="gpu-l-03"}];many-to-many matching not allowed: matching labels must be unique on one side

Query one output floor(avg_over_time(dcgm_gpu_utilization{cluster_name="researchers"}[5m])):

{app="prometheus-node-exporter",chart="prometheus-node-exporter-1.3.0",cluster_name="researchers",gpu="0",heritage="Tiller",instance="172.21.4.101:9100",job="kubernetes-service-endpoints",kubernetes_name="prometheus-node-exporter",kubernetes_namespace="monitoring",release="prometheus-node-exporter",uuid="GPU-92e6ebf6-2b2d-c041-7f70-e16812c0ffa0"}

Query two output max by (node) (kube_node_labels{label_grid="true"}):

{node="gpu-m-08"}
{node="gpu-m-09"}
{node="gpu-m-12"}

I just want to see the node label in the problematic Query output.

BTW this works (without the label_grid=true label):

floor(avg_over_time(dcgm_gpu_utilization{cluster_name="researchers"}[5m])  * on (instance) group_left(nodename) node_uname_info)

It adds the nodename to the Query output labels list.

The main goal is to just see metrics with the label label_grid="true" and their node name.

like image 515
Ziv Rechnitser Avatar asked May 23 '26 18:05

Ziv Rechnitser


2 Answers

The RHS has no instance label, so it's trying to match all those series to one on the LHS. Try max by (node, instance) (kube_node_labels{label_grid="true"})

like image 199
brian-brazil Avatar answered May 26 '26 17:05

brian-brazil


The group_left() modifier expects that the right-hand side of * operator (and any other operator) contains only a single time series per each label=value set specified inside on() modifier. Otherwise it returns duplicate series for the match group error. See these docs for more details.

The solution is to specify the proper labels inside on() modifier, so every label=value set for these labels would have only a single time series on the right-hand side of * operator. The instance label is a good candidate to put inside on() modifier. The only issue is that the dcgm_gpu_utilization and kube_node_labels are collected from different targets with different TCP port numbers. So they have different instance label values (see these docs explaining how instance label is generated). This breaks matching rules for * operator, so the following query returns nothing:

floor(avg_over_time(dcgm_gpu_utilization{cluster_name="researchers"}[5m]))
  * on (instance) group_left(node)
kube_node_labels{label_grid="true"}

This can be fixed by stripping the port number from instance label at both sides of * operator with the help of label_replace function:

label_replace(
  floor(avg_over_time(dcgm_gpu_utilization{cluster_name="researchers"}[5m])),
  "hostname",
  "$1",
  "instance",
  "([^:]+):.+"
)
  * on (hostname) group_left(node)
label_replace(
  kube_node_labels{label_grid="true"},
  "hostname",
  "$1",
  "instance",
  "([^:]+):.+"
)

This query extracts hostname part from instance labels, puts it into a hostname label and then joins the left-hand side and the right-hand side time series on this label.

like image 43
valyala Avatar answered May 26 '26 16:05

valyala



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!