Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to alert on the Kubernetes Cluster health?

We are running as hosted Kubernetes cluster on Google Cloud (GKE) and scraping it with Prometheus.

My Question is similar to this one, but I'd like to know what are the most important metrics to look out for in the K8s Cluster and possibly alert on?

This is rather a K8s then a Prometheus question, but I'd really appreciate some hints. Please let me know if my question is to vague, so I can refine it.

like image 278
tex Avatar asked Sep 07 '16 09:09

tex


People also ask

How do you monitor the health of a Kubernetes cluster?

The most straightforward solution to monitor your Kubernetes cluster is by using a combination of Heapster to collect metrics, InfluxDB to store it in a time series database, and Grafana to present and aggregate the collected information. The Heapster GIT project has the files needed to deploy this design.

How do I check my Kubernetes health?

To check the status of the pod, run the kubectl get pod command and check the STATUS column. As you can see, in this case all the pods are in running state. Also, the READY column states the pod is ready to accept user traffic.


1 Answers

etcd is the foundation of Kubernetes. So having a good set of alerts for it is important. We wrote this blog post and creating alerting rules for it and provided a base set at the end.

Further sources of important metrics in the Prometheus format are the Kubelet and cAdvisor, API servers, and the fairly new kube-state-metrics. For those, I'm not aware of any public alerting rule sets as for etcd, unfortunately.

Generally, you want to ensure that the components as applications work flawlessly, e.g:

  • Are my kubelets/API servers running/reachable? (up metric)
  • Are their response latency and error rates within bounds?
  • Can the API servers reach etcd?

Then there's the Kubernetes business logic aspect, e.g:

  • Are there pods that have been in non-ready/crashloop state forever?
  • Do I have enough CPU/memory capacity in my cluster?
  • Are my deployment replica expectations fulfilled?

That's no drop-in solution unfortunately, but writing alerting rules roughly covering the scope of the above examples should get you quite far.

like image 91
fabxc Avatar answered Sep 26 '22 02:09

fabxc