I have a fresh installation of apache-airflow 1.8.2
, started its webserver, and its gunicorn workers exit for every webpage request, leaving the request hang for around 30s while waiting for a new worker to spawn. Need help fixing this issue.
I've installed apache-airflow 1.8.2
and followed this guide. I started the webserver at port 8081
.
Now when I visit the server using my browser, the response is very slow. I looked at the console output, and noticed that every time I load a webpage, it says "Worker existing", and then pauses for a long time and says "Booting worker".
After digging the source code I found out that these are gunicorn workers. I have no experience with gunicorn or airflow or Flask, so I don't know if this is the expected behavior, but I feel like it shouldn't. At least a webserver should not hang for half a minute for every webpage.
Console output:
---> Browser request
[2017-11-01 19:08:07 -0700] [14549] [INFO] Worker exiting (pid: 14549)
---> Hangs for 30s
[2017-11-01 19:08:37 -0700] [13316] [INFO] Handling signal: ttin
[2017-11-01 19:08:37 -0700] [14698] [INFO] Booting worker with pid: 14698
/Users/michael/Programs/clones/airflow/airflow/www/app.py:23: FlaskWTFDeprecationWarning: "flask_wtf.CsrfProtect" has been renamed to "CSRFProtect" and will be removed in 1.0.
csrf = CsrfProtect()
/Users/michael/Programs/miaozhen/tests/airflow-test/lib/python3.6/site-packages/flask/exthook.py:71: ExtDeprecationWarning: Importing flask.ext.cache is deprecated, use flask_cache instead.
.format(x=modname), ExtDeprecationWarning
127.0.0.1 - - [01/Nov/2017:19:08:37 -0700] "GET /admin/ HTTP/1.1" 200 95063 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
[2017-11-01 19:08:38,096] [14698] {models.py:168} INFO - Filling up the DagBag from /Users/michael/airflow/dags
---> other GET requests on the same webpage, skipped here for simplicity
[2017-11-01 19:08:39 -0700] [13316] [INFO] Handling signal: ttou
Now I'm running a source version of apache-airflow 1.8.2
(i.e. cloned the source, checked out the tag, and installed with pip install -e .
) in a virtualenv
. However I've also tried: running the pypi version (pip install apache-airflow
) without virtualenv
; running the source version without virtualenv
. And the same problem exists for all installations, so these differences are irrelevant.
My Python installation is:
$ python -VV
Python 3.6.3 (default, Oct 4 2017, 06:09:38)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.37)]
EDIT:
I tried installing & running apache-airflow
on another machine (Ubuntu Linux 16.04 + Python 3.5), and there is no problem. I also asked another person who is on Mac with Python 3.6, and there is no problem either. I guess there's something weird with my machine... Any suggestion how I can debug this thing?
If each request takes exactly 1 millisecond to handle, then a single worker can serve 1000 RPS.
Gunicorn is based on the pre-fork worker model. This means that there is a central master process that manages a set of worker processes. The master never knows anything about individual clients. All requests and responses are handled completely by worker processes.
proxy_read_timeout 300s; Restart Nginx server. If above fix doesn't work, then increase Gunicorn timeout flag in Gunicorn configuration, default Gunicorn timeout is 30 seconds. --timeout 90. Gunicorn documentation about timeout.
After 30 seconds (configurable with timeout ) of request processing, gunicorn master process sends SIGTERM to the worker process, to initiate a graceful restart. If worker does not shutdown during another 30 seconds (configurable with graceful_timeout ), master process sends SIGKILL .
Workers regularly exiting from the signal ttou
(means decrement # of processes by one) every so often is intentional. This is airflow periodically "refreshing" workers. Based on what I read in AIRFLOW-276, which added this feature, refreshing workers is to ensure they pickup on new or updated DAGs. This behavior can be modified in your airflow config under worker_refresh_interval
and worker_refresh_batch_size
.
From looking at the source, it spins new workers up before spinning down old workers, so I don't think this would cause a delay for your requests. However, you can try disabling it with worker_refresh_batch_size = 0
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With