Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what are best practices for health checks?

We have a REST API. Right now our /health makes an smoke test on every dependency we have (a database and a couple microservices) and then returns 200 if there are no errors.

The problem is that not all dependencies are mandatory for our application to work. So while a problem accessing the database can be critical, problems accessing some microservices will only affect a small portion of our app.

On top of that we have Amazon ELB. It doesn't seem right to tag our app as unhealty only because one dependency is unhealty. ELB should only try to recover the unhealty dependency and with that our app will be healty again.

Which leads to the question: what should we actually check in our health-check? because it looks like we shouldn't be checking for any dependency at all. On the other hand, it's actually realy helpful to know the status of our app accessing all its dependencies (e.g for troubleshooting problems), so is it common to use some other endpoint for that purpose (say /sanity or /diagnostics)?

like image 909
César Daniel Pérez Avatar asked Oct 26 '17 18:10

César Daniel Pérez


1 Answers

Do not go overboard trying to check for every service, every dependency, etc. in your health check. Basically think of your health check as a Go / No Go test so that the load balancer knows if the service is running.

Load balancers will not recover failed instances. They will just take your service offline. Auto Scaling Groups can recover failed instances by creating new instances and terminating failed instances. CloudWatch can monitor your instances and report problems and cause events to happen (e.g. rebooting).

You can implement more comprehensive tests that run internal to your server and that chose a reporting / recovery path. Examples might include sending an SNS notification to your email or cell phone account, rebooting the server, etc.

Amazon has a number of services to help monitor, report and manage services. Look into CloudWatch for monitoring, SNS or SES for reporting, ASG for auto scaling, etc.

Think thru what type of fault tolerance, high availability and recovery strategy you need for your service. Then implement an approach that is simple enough so that the monitoring itself does not become a point of failure.

like image 57
John Hanley Avatar answered Sep 23 '22 17:09

John Hanley