Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Airflow scheduler stuck

I'm testing the use of Airflow, and after triggering a (seemingly) large number of DAGs at the same time, it seems to just fail to schedule anything and starts killing processes. These are the logs the scheduler prints:

[2019-08-29 11:17:13,542] {scheduler_job.py:214} WARNING - Killing PID 199809
[2019-08-29 11:17:13,544] {scheduler_job.py:214} WARNING - Killing PID 199809
[2019-08-29 11:17:44,614] {scheduler_job.py:214} WARNING - Killing PID 2992
[2019-08-29 11:17:44,614] {scheduler_job.py:214} WARNING - Killing PID 2992
[2019-08-29 11:18:15,692] {scheduler_job.py:214} WARNING - Killing PID 5174
[2019-08-29 11:18:15,693] {scheduler_job.py:214} WARNING - Killing PID 5174
[2019-08-29 11:18:46,765] {scheduler_job.py:214} WARNING - Killing PID 22410
[2019-08-29 11:18:46,766] {scheduler_job.py:214} WARNING - Killing PID 22410
[2019-08-29 11:19:17,845] {scheduler_job.py:214} WARNING - Killing PID 42177
[2019-08-29 11:19:17,846] {scheduler_job.py:214} WARNING - Killing PID 42177
...

I'm using a LocalExecutor with a PostgreSQL backend DB. It seems to be happening only after I'm triggering a large number (>100) of DAGs at about the same time using external triggering. As in:

airflow trigger_dag DAG_NAME

After waiting for it to finish killing whatever processes he is killing, he starts executing all of the tasks properly. I don't even know what these processes were, as I can't really see them after they are killed...

Did anyone encounter this kind of behavior? Any idea why would that happen?

like image 635
GuD Avatar asked Aug 29 '19 15:08

GuD


People also ask

How does the airflow scheduler work?

Once per minute, by default, the scheduler collects DAG parsing results and checks whether any active tasks can be triggered. The Airflow scheduler is designed to run as a persistent service in an Airflow production environment. To kick it off, all you need to do is execute the airflow scheduler command.

Why is my airflow scheduled tasks not working?

Airflow is known for having problems with scheduling a large number of small tasks. In such situations, you should opt for a smaller number of more consolidated tasks. Scheduling a large number of DAGs or tasks at the same time might also be a possible source of issues. To avoid this problem, distribute your tasks more evenly over time.

How do I make the airflow scheduler ignore unnecessary files?

Airflow scheduler ignores files and folders specified in the .airflowignore file. To make the Airflow scheduler ignore unnecessary files: Create an .airflowignore file. In this file, list files and folders that should be ignored. Upload this file to the /dags folder in your environment's bucket.

How does the Kubernetes airflow scheduler work?

The Airflow Scheduler, which runs on Kubernetes Pod A, will indicate to a Worker, which runs on Kubernetes Pod B, that an Operator is ready to be executed. At that point, the Worker will pick up the Operator and execute the work directly on Pod B. This will happen for every Operator that it executes:


2 Answers

The reason for the above in my case was that I had a DAG file creating a very large number of DAGs dynamically.

The "dagbag_import_timeout" config variable which controls "How long before timing out a python file import while filling the DagBag" was set to the default value of 30. Thus the process filling the DagBag kept timing out.

like image 77
GuD Avatar answered Sep 30 '22 09:09

GuD


I've had a very similar issue. My DAG was of the same nature (a file that generates many DAGs dynamically). I tried the suggested solution but it didn't work (had this value to some high already, 60 seconds, increased to 120 but my issue wasn't resolved).

Posting what worked for me in case someone else has a similar issue.

I came across this JIRA ticket: https://issues.apache.org/jira/browse/AIRFLOW-5506

which helped me resolve my issue: I disabled the SLA configuration, and then all my tasks started to run!

There can also be other solutions, as other comments in this ticket suggest.

For the record, my issue started to occur after I enabled lots of such DAGs (around 60?) that I had disabled for a few months. Not sure how the SLA affects this from technical perspective TBH, but it did.

like image 29
babis21 Avatar answered Sep 30 '22 08:09

babis21