Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Systemd http health check

I have a service on a Redhat 7.1 which I use systemctl start, stop, restart and status to control. One time the systemctl status returned active, but the application "behind" the service responded http code different from 200.

I know that I can use Monit or Nagios to check this and do the systemctl restart - but I would like to know if there exist something per default when using systemd, so that I do not need to have other tools installed.

My preferred solution would be to have my service restarted if http return code is different from 200 totally automatically without other tools than systemd itself - (and maybe with a possibility to notify a Hipchat room or send a email...)

I've tried googling the topic - without luck. Please help :-)

like image 542
clausfod Avatar asked Sep 24 '16 17:09

clausfod


1 Answers

The Short Answer

systemd has a native (socket-based) healthcheck method, but it's not HTTP-based. You can write a shim that polls status over HTTP and forwards it to the native mechanism, however.


The Long Answer

The Right Thing in the systemd world is to use the sd_notify socket mechanism to inform the init system when your application is fully available. Use Type=notify for your service to enable this functionality.

You can write to this socket directly using the sd_notify() call, or you can inspect the NOTIFY_SOCKET environment variable to get the name and have your own code write READY=1 to that socket when the application is returning 200s.

If you want to put this off to a separate process that polls your process over HTTP and then writes to the socket, you can do that -- ensure that NotifyAccess is set appropriately (by default, only the main process of the service is allowed to write to the socket).


Inasmuch as you're interested in detecting cases where the application fails after it was fully initialized, and triggering a restart, the sd_notify socket is appropriate in this scenario as well:

Send WATCHDOG_USEC=... to set the amount of time which is permissible between successful tests, then WATCHDOG=1 whenever you have a successful self-test; whenever no successful test is seen for the configured period, your service will be restarted.

like image 185
Charles Duffy Avatar answered Sep 18 '22 08:09

Charles Duffy