Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CRITICAL WORKER TIMEOUT on gunicorn when deployed to AWS

I have a flask web-app that uses a gunicorn server and I have used the gevent worker class as that previously helped me not get [CRITICAL] WORKER TIMEOUT issues before but since I have deployed it on to AWS behind an ELB, I seem to be getting this issue again.

I have tried eventlet worker class before and that didn't work but gevent did locally

This is the shell script that I have used as an entrypoint for my Dockerfile:

gunicorn -b 0.0.0.0:5000 --worker-class=gevent --worker-connections 1000 --timeout 60 --keep-alive 20 dataclone_controller:app

When i check the logs on the pods, this is the only information that gets printed out:

[2019-09-04 11:36:12 +0000] [8] [INFO] Starting gunicorn 19.9.0
   [2019-09-04 11:36:12 +0000] [8] [INFO] Listening at: 
   http://0.0.0.0:5000 (8)
   [2019-09-04 11:36:12 +0000] [8] [INFO] Using worker: gevent
   [2019-09-04 11:36:12 +0000] [11] [INFO] Booting worker with pid: 11
   [2019-09-04 11:38:15 +0000] [8] [CRITICAL] WORKER TIMEOUT (pid:11)
like image 869
siddharth.nair Avatar asked Sep 04 '19 12:09

siddharth.nair


People also ask

Why does gunicorn work timeout?

WORKER TIMEOUT means your application cannot response to the request in a defined amount of time. You can set this using gunicorn timeout settings. Some application need more time to response than another.

What is gunicorn default timeout?

Worker timeouts By default, Gunicorn gracefully restarts a worker if hasn't completed any work within the last 30 seconds. If you expect your application to respond quickly to constant incoming flow of requests, try experimenting with a lower timeout configuration.

Where does gunicorn log by default?

errorlog. The Error log file to write to. Using '-' for FILE makes gunicorn log to stderr. Changed in version 19.2: Log to stderr by default.

How many employees does gunicorn have?

Gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second. Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with.


1 Answers

For our Django application, we eventually tracked this down to memory exhaustion. This is difficult to track down because the AWS monitoring does not provide memory statistics (at least by default) and even if it did, its not clear how easy a transient spike would be to see.

Additional symptoms included:

  • We would often lose network connectivity to the VM at this point.
  • /var/log/syslog contained some evidence of some processes restarting (in our case, this was mostly Hashicorp's Consul).
  • There was no evidence of the Linux OOM detection coming into play.
  • We knew the system was busy because the AWS CPU stats would often show a spike (to say 60%).

The fix for us lay in judicious conversion of Django queries which looked like this:

   for item in qs:
       do_something()

to use .iterator() like this:

CHUNK_SIZE = 5
...
   for item in qs.iterator(CHUNK_SIZE):
       do_something()

which effectively trades database round-trips for lower memory usage. Note that CHUNK_SIZE = 5 made sense because we were fetching some database objects with big columns of JSONB. I expect that more typical usage might use a number several orders of magnitude larger.

like image 59
Shaheed Haque Avatar answered Sep 19 '22 12:09

Shaheed Haque