I can't understand the difference between dag_concurrency
and parallelism
. documentation and some of the related posts here somehow contradicts my findings.
The understanding I had before was that the parallelism
parameter allows you to set the MAX number of global(across all DAGs) TaskRuns possible in airflow and dag_concurrency
to mean the MAX number of TaskRuns possible for a single Dag.
So I set the parallelism
to 8 and dag_concurrency
to 4 and ran a single Dag. And I found out that it was running 8 TIs at a time but I was expecting it to run 4 at a time.
How is that possible?
Also, if it helps, I have set the pool size to 10 or so for these tasks. But that shouldn't have mattered as "config" parameters are given higher priorities than the pool's, Right?
The other answer is only partially correct:
dag_concurrency does not explicitly control tasks per worker. dag_concurrency is the number of tasks running simultaneously per dag_run. So if your DAG has a place where 10 tasks could be running simultaneously but you want to limit the traffic to the workers you would set dag_concurrency lower.
The queues and pools setting also have an effect on the number of tasks per worker.
These setting are very important as you start to build large libraries of simultaneously running DAGs.
parallelism is the maximum number of tasks across all the workers and DAGs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With