Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Airflow parallelism

Tags:

airflow

the Local Executor spawns new processes while scheduling tasks. Is there a limit to the number of processes it creates. I needed to change it. I need to know what is the difference between scheduler's "max_threads" and "parallelism" in airflow.cfg ?

like image 842
sidd607 Avatar asked Jul 05 '16 10:07

sidd607


People also ask

What is parallelism in Airflow?

parallelism : This is the maximum number of tasks that can run concurrently per scheduler within a single Airflow environment. For example, if this setting is set to 32, and there are two schedulers, then no more than 64 tasks can be in the running or queued states at once across all DAGs.

How do I run a parallel Airflow task?

By default, Airflow uses SequentialExecutor which would execute task sequentially no matter what. So to allow Airflow to run tasks in Parallel you will need to create a database in Postges or MySQL and configure it in airflow. cfg ( sql_alchemy_conn param) and then change your executor to LocalExecutor in airflow.

How many tasks can you execute in parallel in Airflow?

Apache Airflow's capability to run parallel tasks, ensured by using Kubernetes and CeleryExecutor, allows you to save a lot of time. You can use it to execute even 1000 parallel tasks in only 5 minutes.

What the parameter parallelism does in the configuration file Airflow CFG?

parallelism is the max number of task instances that can run concurrently on airflow. This means that across all running DAGs, no more than 32 tasks will run at one time.


1 Answers

parallelism: not a very descriptive name. The description says it sets the maximum task instances for the airflow installation, which is a bit ambiguous — if I have two hosts running airflow workers, I'd have airflow installed on two hosts, so that should be two installations, but based on context 'per installation' here means 'per Airflow state database'. I'd name this max_active_tasks.

dag_concurrency: Despite the name based on the comment this is actually the task concurrency, and it's per worker. I'd name this max_active_tasks_for_worker (per_worker would suggest that it's a global setting for workers, but I think you can have workers with different values set for this).

max_active_runs_per_dag: This one's kinda alright, but since it seems to be just a default value for the matching DAG kwarg, it might be nice to reflect that in the name, something like default_max_active_runs_for_dags So let's move on to the DAG kwargs:

concurrency: Again, having a general name like this, coupled with the fact that concurrency is used for something different elsewhere makes this pretty confusing. I'd call this max_active_tasks.

max_active_runs: This one sounds alright to me.

source: https://issues.apache.org/jira/browse/AIRFLOW-57


max_threads gives the user some control over cpu usage. It specifies scheduler parallelism.

like image 170
Roger Avatar answered Sep 25 '22 11:09

Roger