Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Run parallel tasks in Apache Airflow

Tags:

airflow

I am able to configure airflow.cfg file to run tasks one after the other.

What I want to do is, execute tasks in parallel, e.g. 2 at a time and reach the end of list.

How can I configure this?

like image 638
Mit Avatar asked May 04 '18 22:05

Mit


People also ask

Can Airflow run parallel tasks?

Overview. With the release of Airflow 2.3, users can write DAGs that dynamically generate parallel tasks at runtime. This feature, known as dynamic task mapping, is a paradigm shift for DAG design in Airflow.

How many tasks can run in parallel Airflow?

Parallelism: This is the maximum number of tasks that can run at the same time in a single Airflow environment. If this setting is set to 32, for example, no more than 32 tasks can run concurrently across all DAGs.

What is parallelism in Airflow?

parallelism : This is the maximum number of tasks that can run concurrently per scheduler within a single Airflow environment. For example, if this setting is set to 32, and there are two schedulers, then no more than 64 tasks can be in the running or queued states at once across all DAGs.

How do I set dependencies between tasks in Airflow?

Trigger Rules When you set dependencies between tasks, Airflow's default behavior is to run a task only when all upstream tasks have succeeded. However, you can change this default behavior using trigger rules. The options available are: all_success: (default) The task runs only when all upstream tasks have succeeded.


1 Answers

Executing tasks in Airflow in parallel depends on which executor you're using, e.g., SequentialExecutor, LocalExecutor, CeleryExecutor, etc.

For a simple setup, you can achieve parallelism by just setting your executor to LocalExecutor in your airflow.cfg:

[core]
executor = LocalExecutor

Reference: https://github.com/apache/incubator-airflow/blob/29ae02a070132543ac92706d74d9a5dc676053d9/airflow/config_templates/default_airflow.cfg#L76

This will spin up a separate process for each task.

(Of course you'll need to have a DAG with at least 2 tasks that can execute in parallel to see it work.)

Alternatively, with CeleryExecutor, you can spin up any number of workers by just running (as many times as you want):

$ airflow worker

The tasks will go into a Celery queue and each Celery worker will pull off of the queue.

You might find the section Scaling out with Celery in the Airflow Configuration docs helpful.

https://airflow.apache.org/howto/executor/use-celery.html

For any executor, you may want to tweak the core settings that control parallelism once you have that running.

They're all found under [core]. These are the defaults:

# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32

# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16

# Are DAGs paused by default at creation
dags_are_paused_at_creation = True

# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128

# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16

Reference: https://github.com/apache/incubator-airflow/blob/29ae02a070132543ac92706d74d9a5dc676053d9/airflow/config_templates/default_airflow.cfg#L99

like image 166
Taylor D. Edmiston Avatar answered Nov 09 '22 16:11

Taylor D. Edmiston