I am trying to diagnose an under-performing airflow pipeline and am wondering what kind of performance I should expect out of the airflow scheduler in terms similar to "tasks scheduled per second".
I have few queued jobs and many of my tasks finish in seconds so I suspect the scheduler is the limiting component and it is my fault for having many quick tasks. Still, I would rather not rewrite my DAGs if it can be avoided.
What can I do to increase the rate at which the scheduler queues tasks?
Here is what my current airflow.cfg looks like.
I only have two dags running. One is scheduled every 5 min and the other is rarely triggered by the first. I am currently trying to backfill several years at this frequency, but may need to change my approach:
As for worker nodes: I currently have 4 fairly powerful servers running at less than 10% resource usage in disk, network, cpu, RAM, swap. Toggling 3 of the workers off has no impact on my task throughput and the server left on barely even registers the change in workload.
concurrency : This is the maximum number of task instances allowed to run concurrently across all active DAG runs for a given DAG. This allows you to set 1 DAG to be able to run 32 tasks at once, while another DAG might only be able to run 16 tasks at once.
queue is an attribute of BaseOperator, so any task can be assigned to any queue. The default queue for the environment is defined in the airflow. cfg 's celery -> default_queue . This defines the queue that tasks get assigned to when not specified, as well as which queue Airflow workers listen to when started.
DAGs. In Airflow, a DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. A DAG is defined in a Python script, which represents the DAGs structure (tasks and their dependencies) as code.
Every DAG has its schedule, start_date is simply the date a DAG should be included in the eyes of the Airflow scheduler. It also helps the developers to release a DAG before its production date. You could set up start_date more dynamically before Airflow 1.8.
There are a number of config values in your airflow.cfg
that could be related to this.
Under [core]
:
Under [scheduler]
:
Under [worker]
:
Last one is only if you're using the CeleryExecutor
, which I'd definitely recommend if you're looking to increase your task throughput.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With