Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get supervisord to restart hung workers?

I have a number of Python workers managed by supervisord that should continuously print to stdout (after each completed task) if they are working properly. However, they tend to hang, and we've had difficulty finding the bug. Ideally supervisord would notice that they haven't printed in X minutes and restart them; the tasks are idempotent, so non-graceful restarts are fine. Is there any supervisord feature or addon that can do this? Or another supervisor-like program that has this out of the box?

We are already using http://superlance.readthedocs.io/en/latest/memmon.html to kill if memory usage skyrockets, which mitigates some of the hangs, but a hang that doesn't cause a memory leak can still cause the workers to reach a standstill.

like image 523
btown Avatar asked Apr 27 '17 18:04

btown


People also ask

How do I restart my Supervisord process?

To start a non-running service or stop a running one, use supervisorctl start my-daemon and supervisorctl stop my-daemon . To restart a service, you can also use supervisorctl restart my-daemon .

What does Supervisorctl reload do?

It doesn't kill the supervisord process, it just stops all processes, reload the configuration file, and restart processes again.

What is Supervisorctl command?

supervisorctl - supervisorctl Documentation Supervisor is a client/server system that allows its users to monitor and control a number of processes on UNIX-like operating systems. It shares some of the same goals of programs like launchd, daemontools, and runit.

How do you stop a supervisor's process?

Finally, you can exit supervisorctl with Ctrl+C or by entering quit into the prompt: supervisor> quit.


1 Answers

One possible solution would be to wrap your python script in a bash script that'd monitor it and exit if there isn't output to stdout for a period of time.

For example:

kill-if-hung.sh

#!/usr/bin/env bash
set -e

TIMEOUT=60
LAST_CHANGED="$(date +%s)"

{
    set -e
    while true; do
        sleep 1
        kill -USR1 $$
    done
} &

trap check_output USR1

check_output() {
    CURRENT="$(date +%s)"
    if [[ $((CURRENT - LAST_CHANGED)) -ge $TIMEOUT ]]; then
        echo "Process STDOUT hasn't printed in $TIMEOUT seconds"
        echo "Considering process hung and exiting"
        exit 1
    fi
}

STDOUT_PIPE=$(mktemp -u)
mkfifo $STDOUT_PIPE

trap cleanup EXIT
cleanup() {
    kill -- -$$ # Send TERM to child processes
    [[ -p $STDOUT_PIPE ]] && rm -f $STDOUT_PIPE
}

$@ >$STDOUT_PIPE || exit 2 &

while true; do
    if read tmp; then
        echo "$tmp"
        LAST_CHANGED="$(date +%s)"
    fi
done <$STDOUT_PIPE

Then you would run a python script in supervisord like: kill-if-hung.sh python -u some-script.py (-u to disable output buffering, or set PYTHONUNBUFFERED).

I'm sure you could imagine a python script that'd do something similar.

like image 196
Ben Avatar answered Oct 03 '22 19:10

Ben