I am running a flask app with gunicorn on an EC2 server. I use supervisord to monitor and restart the app server. Yesterday, the server was not responding to http requests. We looked at the status using supervisorctl, and it showed up as running. We looked at the supervisor logs and saw the following error:
CRIT uncaptured python exception, closing channel <POutputDispatcher at 34738328
for <Subprocess at 34314576 with name flask in state RUNNING> (stdout)>
(<type 'exceptions.OSError'>:[Errno 2] No such file or directory
[/usr/local/lib/python2.7/dist-packages/supervisor/supervisord.py|runforever|233]
[/usr/local/lib/python2.7/dist-packages/supervisor/dispatchers.py|handle_read_event|231]
[/usr/local/lib/python2.7/dist-packages/supervisor/dispatchers.py|record_output|165]
[/usr/local/lib/python2.7/dist-packages/supervisor/dispatchers.py|_log|141]
[/usr/local/lib/python2.7/dist-packages/supervisor/loggers.py|info|273]
[/usr/local/lib/python2.7/dist-packages/supervisor/loggers.py|log|291]
[/usr/local/lib/python2.7/dist-packages/supervisor/loggers.py|emit|186]
[/usr/local/lib/python2.7/dist-packages/supervisor/loggers.py|doRollover|220])
Restarting supervisord fixed the issue for us. Below are the relevant parts of our supervisor config:
[supervisord]
childlogdir = /var/log/supervisord/
logfile = /var/log/supervisord/supervisord.log
logfile_maxbytes = 50MB
logfile_backups = 10
loglevel = info
pidfile = /var/log/supervisord/supervisord.pid
umask = 022
nodaemon = false
nocleanup = false
[program:flask]
directory=%(here)s
environment=PATH="/home/ubuntu/.virtualenvs/flask/bin"
command=newrelic-admin run-program gunicorn app:app -c gunicorn_conf.py
autostart=true
autorestart=true
redirect_stderr=true
What's strange is that we have 2 servers running behind an ELB and both of them had the same issue 10 mins from each other. I am guessing that the logs for both reached the limit around the same time (which is possible since they both see about the same amount of traffic) and the rollover failed. Any ideas as to why that could've happened?
AFAIK supervisor uses its own logging implementation, not the one in the Python stdlib - although the class and method names are pretty similar.
There is a potential race condition when deleting files during rollover - you will need to check the source code of your specific supervisor version and compare that with the latest supervisor version, if different. Here is an excerpt from the supervisor code on my system (in the doRollover() method):
try:
os.remove(dfn)
except OSError, why:
# catch race condition (already deleted)
if why[0] != errno.ENOENT:
raise
If your rollover code doesn't do this, you might need to upgrade your supervisor version.
Update: If the error happens on the rename, then it might be a race condition which hasn't yet been caught. Consider asking on the supervisor mailing list.
Within supervisor sections (example: [program:flask]), you need to set:
stdout_logfile_maxbytes=0
stderr_logfile_maxbytes=0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With