We are running as hosted Kubernetes cluster on Google Cloud (GKE) and scraping it with Prometheus.
My Question is similar to this one, but I'd like to know what are the most important metrics to look out for in the K8s Cluster and possibly alert on?
This is rather a K8s then a Prometheus question, but I'd really appreciate some hints. Please let me know if my question is to vague, so I can refine it.
The most straightforward solution to monitor your Kubernetes cluster is by using a combination of Heapster to collect metrics, InfluxDB to store it in a time series database, and Grafana to present and aggregate the collected information. The Heapster GIT project has the files needed to deploy this design.
To check the status of the pod, run the kubectl get pod command and check the STATUS column. As you can see, in this case all the pods are in running state. Also, the READY column states the pod is ready to accept user traffic.
etcd is the foundation of Kubernetes. So having a good set of alerts for it is important. We wrote this blog post and creating alerting rules for it and provided a base set at the end.
Further sources of important metrics in the Prometheus format are the Kubelet and cAdvisor, API servers, and the fairly new kube-state-metrics. For those, I'm not aware of any public alerting rule sets as for etcd, unfortunately.
Generally, you want to ensure that the components as applications work flawlessly, e.g:
up
metric)Then there's the Kubernetes business logic aspect, e.g:
That's no drop-in solution unfortunately, but writing alerting rules roughly covering the scope of the above examples should get you quite far.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With