Supervisor exception on log rollover causes app server to freeze?

Question

I am running a flask app with gunicorn on an EC2 server. I use supervisord to monitor and restart the app server. Yesterday, the server was not responding to http requests. We looked at the status using supervisorctl, and it showed up as running. We looked at the supervisor logs and saw the following error:

CRIT uncaptured python exception, closing channel <POutputDispatcher at 34738328
for <Subprocess at 34314576 with name flask in state RUNNING> (stdout)>
(<type 'exceptions.OSError'>:[Errno 2] No such file or directory

[/usr/local/lib/python2.7/dist-packages/supervisor/supervisord.py|runforever|233] 
[/usr/local/lib/python2.7/dist-packages/supervisor/dispatchers.py|handle_read_event|231] 
[/usr/local/lib/python2.7/dist-packages/supervisor/dispatchers.py|record_output|165] 
[/usr/local/lib/python2.7/dist-packages/supervisor/dispatchers.py|_log|141]
[/usr/local/lib/python2.7/dist-packages/supervisor/loggers.py|info|273] 
[/usr/local/lib/python2.7/dist-packages/supervisor/loggers.py|log|291] 
[/usr/local/lib/python2.7/dist-packages/supervisor/loggers.py|emit|186]
[/usr/local/lib/python2.7/dist-packages/supervisor/loggers.py|doRollover|220])

Restarting supervisord fixed the issue for us. Below are the relevant parts of our supervisor config:

[supervisord]
childlogdir = /var/log/supervisord/
logfile = /var/log/supervisord/supervisord.log
logfile_maxbytes = 50MB
logfile_backups = 10
loglevel = info
pidfile = /var/log/supervisord/supervisord.pid
umask = 022
nodaemon = false
nocleanup = false

[program:flask]
directory=%(here)s
environment=PATH="/home/ubuntu/.virtualenvs/flask/bin"
command=newrelic-admin run-program gunicorn app:app -c gunicorn_conf.py
autostart=true
autorestart=true
redirect_stderr=true

What's strange is that we have 2 servers running behind an ELB and both of them had the same issue 10 mins from each other. I am guessing that the logs for both reached the limit around the same time (which is possible since they both see about the same amount of traffic) and the rollover failed. Any ideas as to why that could've happened?

Vinay Sajip · Accepted Answer

AFAIK supervisor uses its own logging implementation, not the one in the Python stdlib - although the class and method names are pretty similar.

There is a potential race condition when deleting files during rollover - you will need to check the source code of your specific supervisor version and compare that with the latest supervisor version, if different. Here is an excerpt from the supervisor code on my system (in the doRollover() method):

try:
    os.remove(dfn)
except OSError, why:
    # catch race condition (already deleted)
    if why[0] != errno.ENOENT:
        raise

If your rollover code doesn't do this, you might need to upgrade your supervisor version.

Update: If the error happens on the rename, then it might be a race condition which hasn't yet been caught. Consider asking on the supervisor mailing list.

Thiago Falcao · Answer

Within supervisor sections (example: [program:flask]), you need to set:

stdout_logfile_maxbytes=0
stderr_logfile_maxbytes=0

Supervisor exception on log rollover causes app server to freeze?

Tags:

python

logging

flask

gunicorn

supervisord

tapan

2 Answers

Vinay Sajip

Thiago Falcao

Recent Activity

Donate For Us

Supervisor exception on log rollover causes app server to freeze?

Tags:

python

logging

flask

gunicorn

supervisord

tapan

2 Answers

Vinay Sajip

Thiago Falcao

Related questions

Recent Activity

Donate For Us