I am completely new to airflow, and couldn't find anywhere that how many tasks can be scheduled in a single airflow DAG. And what can be the maximum size of each task.
I want to schedule a task which should be able to handle millions of queries and identify its type and schedule the next task according to the type of query.
Read complete documentation but couldn't find it
You can also tune your worker_concurrency (environment variable: AIRFLOW__CELERY__WORKER_CONCURRENCY ), which determines how many tasks each Celery worker can run at any given time. By default, the Celery executor runs a maximum of sixteen tasks concurrently.
You can have as many DAGs as you want, each describing an arbitrary number of tasks. In general, each one should correspond to a single logical workflow. When searching for DAGs, Airflow will only consider files where the string “airflow” and “DAG” both appear in the contents of the . py file.
Apache Airflow's capability to run parallel tasks, ensured by using Kubernetes and CeleryExecutor, allows you to save a lot of time. You can use it to execute even 1000 parallel tasks in only 5 minutes.
You can have the Airflow Scheduler be responsible for starting the process that turns the Python files contained in the DAGs folder into DAG objects that contain tasks to be scheduled.
There are no limits to how many tasks can be part of a single DAG.
Through the Airflow config, you can set concurrency limitations for execution time such as the maximum number of parallel tasks overall, maximum number of concurrent DAG runs for a given DAG, etc. There are settings at the Airflow level, DAG level, and operator level for more coarse to fine-grained control.
Here are the high-level concurrency settings you can tweak:
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
# Are DAGs paused by default at creation
dags_are_paused_at_creation = True
# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
Reference: default_airflow.cfg
The parallelism settings are described in more detail in this answer. As far as the maximum "size" of each task, I'm assuming you're referring to resource allocation, such as memory or CPU. This is user configurable depending upon which executor you choose to use:
LocalExecutor
, for instance, it will use any resources available on the host.MesosExecutor
on the other hand, one can define the max amount of CPU and/or memory that will be allocated to a task instance, and through DockerOperator
you also have the option to define the maximum amount of CPU and memory a given task instance will use.CeleryExecutor
, you can set worker_concurrency
to define the number of task instances each worker will take.Another way to restrict execution is to use the Pools feature (example), for instance, you can set the max size of a pool of tasks talking to a database to 5 to prevent more than 5 tasks from hitting it at once (and potentially overloading the database/API/whatever resource you want to pool against).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With