Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Supervisor exception on log rollover causes app server to freeze?

I am running a flask app with gunicorn on an EC2 server. I use supervisord to monitor and restart the app server. Yesterday, the server was not responding to http requests. We looked at the status using supervisorctl, and it showed up as running. We looked at the supervisor logs and saw the following error:

CRIT uncaptured python exception, closing channel <POutputDispatcher at 34738328
for <Subprocess at 34314576 with name flask in state RUNNING> (stdout)>
(<type 'exceptions.OSError'>:[Errno 2] No such file or directory

[/usr/local/lib/python2.7/dist-packages/supervisor/supervisord.py|runforever|233] 
[/usr/local/lib/python2.7/dist-packages/supervisor/dispatchers.py|handle_read_event|231] 
[/usr/local/lib/python2.7/dist-packages/supervisor/dispatchers.py|record_output|165] 
[/usr/local/lib/python2.7/dist-packages/supervisor/dispatchers.py|_log|141]
[/usr/local/lib/python2.7/dist-packages/supervisor/loggers.py|info|273] 
[/usr/local/lib/python2.7/dist-packages/supervisor/loggers.py|log|291] 
[/usr/local/lib/python2.7/dist-packages/supervisor/loggers.py|emit|186]
[/usr/local/lib/python2.7/dist-packages/supervisor/loggers.py|doRollover|220])

Restarting supervisord fixed the issue for us. Below are the relevant parts of our supervisor config:

[supervisord]
childlogdir = /var/log/supervisord/
logfile = /var/log/supervisord/supervisord.log
logfile_maxbytes = 50MB
logfile_backups = 10
loglevel = info
pidfile = /var/log/supervisord/supervisord.pid
umask = 022
nodaemon = false
nocleanup = false

[program:flask]
directory=%(here)s
environment=PATH="/home/ubuntu/.virtualenvs/flask/bin"
command=newrelic-admin run-program gunicorn app:app -c gunicorn_conf.py
autostart=true
autorestart=true
redirect_stderr=true

What's strange is that we have 2 servers running behind an ELB and both of them had the same issue 10 mins from each other. I am guessing that the logs for both reached the limit around the same time (which is possible since they both see about the same amount of traffic) and the rollover failed. Any ideas as to why that could've happened?

like image 782
tapan Avatar asked Dec 07 '25 02:12

tapan


2 Answers

AFAIK supervisor uses its own logging implementation, not the one in the Python stdlib - although the class and method names are pretty similar.

There is a potential race condition when deleting files during rollover - you will need to check the source code of your specific supervisor version and compare that with the latest supervisor version, if different. Here is an excerpt from the supervisor code on my system (in the doRollover() method):

try:
    os.remove(dfn)
except OSError, why:
    # catch race condition (already deleted)
    if why[0] != errno.ENOENT:
        raise

If your rollover code doesn't do this, you might need to upgrade your supervisor version.

Update: If the error happens on the rename, then it might be a race condition which hasn't yet been caught. Consider asking on the supervisor mailing list.

like image 196
Vinay Sajip Avatar answered Dec 08 '25 16:12

Vinay Sajip


Within supervisor sections (example: [program:flask]), you need to set:

stdout_logfile_maxbytes=0
stderr_logfile_maxbytes=0
like image 30
Thiago Falcao Avatar answered Dec 08 '25 15:12

Thiago Falcao



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!