Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Airflow scheduler is slow to schedule subsequent tasks

Tags:

When I try to run a DAG in Airflow 1.8.0 I find that it takes a lot of time between the time of completion predecessor task and the time at which the successor task is picked up for execution (usually greater the execution times of individual tasks). The same is the scenario for Sequential, Local and Celery Executors. Is there a way to lessen the overhead time mentioned? (like any parameters in airflow.cfg that can speed up the DAG execution?) Gantt chart has been added for reference: Gantt chart

like image 753
Prasann Avatar asked Nov 23 '17 08:11

Prasann


People also ask

How often does Airflow check for new DAGs?

Airflow scans the dags_folder for new DAGs every dag_dir_list_interval , which defaults to 5 minutes but can be modified. You might have to wait until this interval has passed before a new DAG appears in the UI.

Is Start_date mandatory in Airflow DAG?

When creating a new DAG, you probably want to set a global start_date for your tasks. This can be done by declaring your start_date directly in the DAG() object. The first DagRun to be created will be based on the min(start_date) for all your tasks.

How many tasks can an Airflow worker handle?

concurrency : This is the maximum number of task instances allowed to run concurrently across all active DAG runs for a given DAG. This allows you to set 1 DAG to be able to run 32 tasks at once, while another DAG might only be able to run 16 tasks at once.

How do I keep my Airflow scheduler running?

The scheduler should be running all the time. You should just run airflow scheduler without a num_runs param. The scheduler is designed to be a long running process, an infinite loop. It orchestrates the work that is being done, it is the heart of airflow.


2 Answers

As Nick said, Airflow is not a real-time tool. Tasks are scheduled and executed ASAP, but the next Task will never run immediately after the last one.

When you have more than ~100 DAGs with ~3 Tasks in each one or Dags with many Tasks (~100 or more), you have to consider 3 things:

  1. Increase the number of threads that the DagFileProcessorManager will use to load and execute the Dags (airflow.cfg):

[scheduler]

max_threads = 2

The max_threads controls how many DAGs are picked and executed/terminated (see here).

Increasing this configuration may reduce the time between the Tasks.

  1. Monitor your Airflow Database to see if it has any bottlenecks. The Airflow database is used to manage and execute processes:

Recently we were suffering with the same problem. The time between Tasks was ~10-15 minutes, we were using PostgreSQL on AWS.

The instance was not using the resources very well; ~20 IOPS, 20% of the memory and ~10% of CPU, but Airflow was very slow.

After looking at the database performance using PgHero, we discovered that even a query using an Index on a small table was spending more than one second.

So we increased the Database size, and Airflow is now running as fast as a rocket. :)

  1. To get the time Airflow is spending loading Dags, run the command:

airflow list_dags -r

DagBag parsing time: 7.9497220000000075

If the DagBag parsing time is higher than ~5 minutes, it could be an issue.

All of this helped us to run Airflow faster. I really advise you to upgrade to version 1.9 as there are many performance issues that were fixed on this version

BTW, we are using the Airflow master in production, with LocalExecutor and PostgreSQL as the metadata database.

like image 181
Marcos Bernardelli Avatar answered Sep 27 '22 21:09

Marcos Bernardelli


Your Gantt chart shows things in the order of seconds. Airflow is not meant to be a real-time scheduling engine. It deals with things on the order of minutes. If you need things to run faster, you may consider different scheduling tool from airflow. Alternatively you can put all of the work into a single task so you do not suffer from the delays of the scheduler.

like image 3
Nick Avatar answered Sep 28 '22 21:09

Nick