We are running as hosted Kubernetes cluster on Google Cloud (GKE) and scraping it with Prometheus. My Question is similar to this one, but I'd like to know what are the most important metrics to look out for in the K8s Cluster and possibly alert on? This is rather a K8s then a Prometheus question, but I'd really appreciate some hints. Please let me know if my question is to vague, so I can refine it.

etcd is the foundation of Kubernetes. So having a good set of alerts for it is important. We wrote this blog post and creating alerting rules for it and provided a base set at the end. Further sources of important metrics in the Prometheus format are the Kubelet and cAdvisor, API servers, and the fairly new kube-state-metrics. For those, I'm not aware of any public alerting rule sets as for etcd, unfortunately. Generally, you want to ensure that the components as applications work flawlessly, e.g: <ul> <li>Are my kubelets/API servers running/reachable? (<code>up</code> metric)</li> <li>Are their response latency and error rates within bounds?</li> <li>Can the API servers reach etcd?</li> </ul> Then there's the Kubernetes business logic aspect, e.g: <ul> <li>Are there pods that have been in non-ready/crashloop state forever?</li> <li>Do I have enough CPU/memory capacity in my cluster?</li> <li>Are my deployment replica expectations fulfilled?</li> </ul> That's no drop-in solution unfortunately, but writing alerting rules roughly covering the scope of the above examples should get you quite far.

How to alert on the Kubernetes Cluster health?

1 Answers

etcd is the foundation of Kubernetes. So having a good set of alerts for it is important. We wrote this blog post and creating alerting rules for it and provided a base set at the end.

Further sources of important metrics in the Prometheus format are the Kubelet and cAdvisor, API servers, and the fairly new kube-state-metrics. For those, I'm not aware of any public alerting rule sets as for etcd, unfortunately.

Generally, you want to ensure that the components as applications work flawlessly, e.g:

Are my kubelets/API servers running/reachable? (up metric)
Are their response latency and error rates within bounds?
Can the API servers reach etcd?

Then there's the Kubernetes business logic aspect, e.g:

Are there pods that have been in non-ready/crashloop state forever?
Do I have enough CPU/memory capacity in my cluster?
Are my deployment replica expectations fulfilled?

That's no drop-in solution unfortunately, but writing alerting rules roughly covering the scope of the above examples should get you quite far.

answered Sep 26 '22 02:09

fabxc

Related questions
                            
                                Accessing FIRUser in ViewController in iOS
                            
                                Swift & Firebase - How to store more user data other than email and password?
                            
                                Firebase 9.0.2 authentication error codes
                            
                                How to add/update the port of a backend in a Backend Service of an HTTP Load Balancer in GCP using gcloud CLI
                            
                                Link Facebook to Firebase Anonymous Auth without calling Facebook API
                            
                                I am able to add composite indexes to index.yaml file and get it to work without removing and reloading the data from datastore
                            
                                FCMMessagingService in clean architecture?
                            
                                Add on touch listener for Firebase RecyclerView
                            
                                Firebase retrieve all data on app start
                            
                                Terminate google cloud compute engine instance with shell/bash script
                            
                                How can I log to Google Cloud Logging from an AngularJS application?
                            
                                Build runs but archive fails when referencing Firebase in multiple targets with CocoaPods
                            
                                Android AppEngine Endpoints Auth and InApp Billing
                            
                                Firebase Firestore toObject fails on Boolean property mapping
                            
                                What is the difference between a realtime database and a "normal" database?
                            
                                There is no API Console project with the id specified in the manifest's api_console_project_id field
                            
                                FieldValue arrayUnion and Cloud FireStore with Flutter
                            
                                Firebase App Distribution - "Waiting for developer" message
                            
                                taskSnapshot.getDownloadUrl() method not working
                            
                                ERROR: (gcloud.compute.ssh) Could not fetch resource: - Insufficient Permission

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to alert on the Kubernetes Cluster health?

Tags:

kubernetes

google-kubernetes-engine

prometheus

tex

People also ask

1 Answers

fabxc

Recent Activity

Donate For Us