Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what's the difference between airflow's 'parallelism' and 'dag_concurrency'

Tags:

python

airflow

I can't understand the difference between dag_concurrency and parallelism. documentation and some of the related posts here somehow contradicts my findings.

The understanding I had before was that the parallelism parameter allows you to set the MAX number of global(across all DAGs) TaskRuns possible in airflow and dag_concurrency to mean the MAX number of TaskRuns possible for a single Dag.

So I set the parallelism to 8 and dag_concurrency to 4 and ran a single Dag. And I found out that it was running 8 TIs at a time but I was expecting it to run 4 at a time.

  1. How is that possible?

  2. Also, if it helps, I have set the pool size to 10 or so for these tasks. But that shouldn't have mattered as "config" parameters are given higher priorities than the pool's, Right?

like image 996
SpaceyBot Avatar asked Apr 17 '19 08:04

SpaceyBot


1 Answers

The other answer is only partially correct:

dag_concurrency does not explicitly control tasks per worker. dag_concurrency is the number of tasks running simultaneously per dag_run. So if your DAG has a place where 10 tasks could be running simultaneously but you want to limit the traffic to the workers you would set dag_concurrency lower.

The queues and pools setting also have an effect on the number of tasks per worker.

These setting are very important as you start to build large libraries of simultaneously running DAGs.

parallelism is the maximum number of tasks across all the workers and DAGs.

like image 139
trejas Avatar answered Oct 21 '22 17:10

trejas