I have a flask web-app that uses a gunicorn server and I have used the gevent worker class as that previously helped me not get [CRITICAL] WORKER TIMEOUT
issues before but since I have deployed it on to AWS behind an ELB, I seem to be getting this issue again.
I have tried eventlet
worker class before and that didn't work but gevent
did locally
This is the shell script that I have used as an entrypoint for my Dockerfile:
gunicorn -b 0.0.0.0:5000 --worker-class=gevent --worker-connections 1000 --timeout 60 --keep-alive 20 dataclone_controller:app
When i check the logs on the pods, this is the only information that gets printed out:
[2019-09-04 11:36:12 +0000] [8] [INFO] Starting gunicorn 19.9.0
[2019-09-04 11:36:12 +0000] [8] [INFO] Listening at:
http://0.0.0.0:5000 (8)
[2019-09-04 11:36:12 +0000] [8] [INFO] Using worker: gevent
[2019-09-04 11:36:12 +0000] [11] [INFO] Booting worker with pid: 11
[2019-09-04 11:38:15 +0000] [8] [CRITICAL] WORKER TIMEOUT (pid:11)
WORKER TIMEOUT means your application cannot response to the request in a defined amount of time. You can set this using gunicorn timeout settings. Some application need more time to response than another.
Worker timeouts By default, Gunicorn gracefully restarts a worker if hasn't completed any work within the last 30 seconds. If you expect your application to respond quickly to constant incoming flow of requests, try experimenting with a lower timeout configuration.
errorlog. The Error log file to write to. Using '-' for FILE makes gunicorn log to stderr. Changed in version 19.2: Log to stderr by default.
Gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second. Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with.
For our Django application, we eventually tracked this down to memory exhaustion. This is difficult to track down because the AWS monitoring does not provide memory statistics (at least by default) and even if it did, its not clear how easy a transient spike would be to see.
Additional symptoms included:
The fix for us lay in judicious conversion of Django queries which looked like this:
for item in qs:
do_something()
to use .iterator() like this:
CHUNK_SIZE = 5
...
for item in qs.iterator(CHUNK_SIZE):
do_something()
which effectively trades database round-trips for lower memory usage. Note that CHUNK_SIZE = 5 made sense because we were fetching some database objects with big columns of JSONB. I expect that more typical usage might use a number several orders of magnitude larger.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With