how many tasks can be scheduled in a single airflow dag?

Tags:

airflow

I am completely new to airflow, and couldn't find anywhere that how many tasks can be scheduled in a single airflow DAG. And what can be the maximum size of each task.

I want to schedule a task which should be able to handle millions of queries and identify its type and schedule the next task according to the type of query.

Read complete documentation but couldn't find it

500

asked Jun 07 '18 09:06

SJxD

1 Answers

There are no limits to how many tasks can be part of a single DAG.

Through the Airflow config, you can set concurrency limitations for execution time such as the maximum number of parallel tasks overall, maximum number of concurrent DAG runs for a given DAG, etc. There are settings at the Airflow level, DAG level, and operator level for more coarse to fine-grained control.

Here are the high-level concurrency settings you can tweak:

# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32

# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16

# Are DAGs paused by default at creation
dags_are_paused_at_creation = True

# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128

# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16

Reference: default_airflow.cfg

The parallelism settings are described in more detail in this answer. As far as the maximum "size" of each task, I'm assuming you're referring to resource allocation, such as memory or CPU. This is user configurable depending upon which executor you choose to use:

In a simple setup with LocalExecutor, for instance, it will use any resources available on the host.
In contrast, with the MesosExecutor on the other hand, one can define the max amount of CPU and/or memory that will be allocated to a task instance, and through DockerOperator you also have the option to define the maximum amount of CPU and memory a given task instance will use.
With the CeleryExecutor, you can set worker_concurrency to define the number of task instances each worker will take.

Another way to restrict execution is to use the Pools feature (example), for instance, you can set the max size of a pool of tasks talking to a database to 5 to prevent more than 5 tasks from hitting it at once (and potentially overloading the database/API/whatever resource you want to pool against).

168

answered Oct 18 '22 12:10

Taylor D. Edmiston

Related questions
                            
                                Errno 13 Permission denied when Airflow tries to write to logs
                            
                                Airflow - Send email with AWS SES
                            
                                Triggering a Prefect workflow externally
                            
                                Only works with the CeleryExecutor
                            
                                Creating connection outside of Airflow GUI
                            
                                Apache airflow macro to get last dag run execution time
                            
                                Airflow 1.10 - Unknown task runner type StandardTaskRunner
                            
                                Bigquery : Create table if not exist and load data using Python and Apache AirFlow
                            
                                Failed to import authentication backend after changing the airflow.cfg
                            
                                Airflow Why the scheduler doesn't start my DAG?
                            
                                How to trigger airflow dag manually?
                            
                                How do we trigger multiple airflow dags using TriggerDagRunOperator?
                            
                                Can we set priority_weight per task in Airflow?
                            
                                Dag Seems to be missing
                            
                                Pulling xcom from sub dag
                            
                                Cannot access airflow web server via AWS load balancer HTTPS because airflow redirects me to HTTP
                            
                                Airflow - Unknown Blue Task Status
                            
                                Confused about Airflow's BaseSensorOperator parameters : timeout, poke_interval and mode
                            
                                Cannot install additional requirements to apache airflow
                            
                                How to retrieve a value from Airflow XCom pushed via SSHExecuteOperator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With