I am trying to diagnose an under-performing airflow pipeline and am wondering what kind of performance I should expect out of the airflow scheduler in terms similar to "tasks scheduled per second". I have few queued jobs and many of my tasks finish in seconds so I suspect the scheduler is the limiting component and it is my fault for having many quick tasks. Still, I would rather not rewrite my DAGs if it can be avoided. What can I do to increase the rate at which the scheduler queues tasks? <hr> <h3>Pipeline Details</h3> Here is what my current airflow.cfg looks like. I only have two dags running. One is scheduled every 5 min and the other is rarely triggered by the first. I am currently trying to backfill several years at this frequency, but may need to change my approach: <img src="https://i.stack.imgur.com/C1IUa.png" alt="enter image description here"> As for worker nodes: I currently have 4 fairly powerful servers running at less than 10% resource usage in disk, network, cpu, RAM, swap. Toggling 3 of the workers off has no impact on my task throughput and the server left on barely even registers the change in workload.

There are a number of config values in your <code>airflow.cfg</code> that could be related to this. Under <code>[core]</code>: <ul> <li> parallelism: Total number of task instances that can run at once.</li> <li> dag_concurrency: Limit of task instances that can run per DAG run, may need to bump if you have many parallel tasks. Can override when defining a DAG. </li> <li> non_pooled_task_slot_count: Limit of tasks without a pool configured that can run at once.</li> <li> max_active_runs_per_dag: The maximum number of active DAG runs per DAG. If you're triggering runs manually or there's a backup of DAG runs scheduled with a short interval. Can override when defining a DAG. </li> </ul> Under <code>[scheduler]</code>: <ul> <li> schedule_heartbeat_sec: Defines how often the scheduler runs, try it out with lower values.</li> <li> min_file_process_interval: Process each file at most once every N seconds. Set to 0 to never limit how often you process a file.</li> </ul> Under <code>[worker]</code>: <ul> <li> celeryd_concurrency: Number of workers celery will run with, so essentially number of task instances a worker can take at once. Matching the number of CPUs is a popular starting point, but can definitely go higher.</li> </ul> Last one is only if you're using the <code>CeleryExecutor</code>, which I'd definitely recommend if you're looking to increase your task throughput.

How to increase tasks queued per second?

Tags:

airflow

airflow-scheduler

I am trying to diagnose an under-performing airflow pipeline and am wondering what kind of performance I should expect out of the airflow scheduler in terms similar to "tasks scheduled per second".

I have few queued jobs and many of my tasks finish in seconds so I suspect the scheduler is the limiting component and it is my fault for having many quick tasks. Still, I would rather not rewrite my DAGs if it can be avoided.

What can I do to increase the rate at which the scheduler queues tasks?

Pipeline Details

Here is what my current airflow.cfg looks like.

I only have two dags running. One is scheduled every 5 min and the other is rarely triggered by the first. I am currently trying to backfill several years at this frequency, but may need to change my approach:

enter image description here

As for worker nodes: I currently have 4 fairly powerful servers running at less than 10% resource usage in disk, network, cpu, RAM, swap. Toggling 3 of the workers off has no impact on my task throughput and the server left on barely even registers the change in workload.

750

asked Feb 01 '18 16:02

7yl4r

1 Answers

There are a number of config values in your airflow.cfg that could be related to this.

Under [core]:

parallelism: Total number of task instances that can run at once.
dag_concurrency: Limit of task instances that can run per DAG run, may need to bump if you have many parallel tasks. Can override when defining a DAG.
non_pooled_task_slot_count: Limit of tasks without a pool configured that can run at once.
max_active_runs_per_dag: The maximum number of active DAG runs per DAG. If you're triggering runs manually or there's a backup of DAG runs scheduled with a short interval. Can override when defining a DAG.

Under [scheduler]:

schedule_heartbeat_sec: Defines how often the scheduler runs, try it out with lower values.
min_file_process_interval: Process each file at most once every N seconds. Set to 0 to never limit how often you process a file.

Under [worker]:

celeryd_concurrency: Number of workers celery will run with, so essentially number of task instances a worker can take at once. Matching the number of CPUs is a popular starting point, but can definitely go higher.

Last one is only if you're using the CeleryExecutor, which I'd definitely recommend if you're looking to increase your task throughput.

answered Sep 29 '22 21:09

Daniel Huang

Related questions
                            
                                Airflow connection to MySQL
                            
                                How to log inside Python function in PythonOperator in Airflow
                            
                                Use airflow hive operator and output to a text file
                            
                                Airflow BashOperator doesn't work but PythonOperator does
                            
                                Apache Airflow pools: used slots > available slots
                            
                                When I init a dag with a Variable param, it raises an Exception
                            
                                Google Cloud Composer taking too long to install dependencies
                            
                                Airflow - got an unexpected keyword argument 'dag'
                            
                                apache airflow configuration is empty and dags && plugins missing
                            
                                Trouble with connection between Apache Airflow and AWS Glue
                            
                                airflow.exceptions.AirflowException: Use keyword arguments when initializing operators
                            
                                Airflow ExternalTaskSensor execution timeout
                            
                                Faster alternative to apache airflow for workflows with many tasks
                            
                                Airflow: Set custom run_id for TriggerDagRunOperator
                            
                                how to get latest execution time of a dag run in airflow
                            
                                Airflow - call a operator inside a function
                            
                                Running DBT within Airflow through the Docker Operator
                            
                                What is causing airflow webserver to fail and restart on docker for Mac?
                            
                                MWAA Airflow 2.0 in AWS Snowflake connection not showing
                            
                                Problems using Airflow v1.9 Python Operator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With