Has anyone reported how much they've been able to get Airflow to scale at their company? I'm looking at implementing Airflow to execute 5,000+ tasks that will each run hourly, and someday scale that up to 20,000+ tasks. In examining the scheduler it looks like that might be a bottleneck since only one instance of it can run, and I'm concerned with that many tasks the scheduler will struggle to keep up. Should I be?
Overview. One of Apache Airflow's biggest strengths is its ability to scale with good supporting infrastructure. To make the most of Airflow, there are a few key settings that you should consider modifying as you scale up your data pipelines.
Apache Airflow's capability to run parallel tasks, ensured by using Kubernetes and CeleryExecutor, allows you to save a lot of time. You can use it to execute even 1000 parallel tasks in only 5 minutes.
Airflow comes with various executors, but the most widely used among those is Airflow Celery Executor used for scaling out by distributing the workload on multiple Celery workers that can run on different machines. CeleryExecutor works with some workers it has to distribute the tasks with the help of messages.
Airflow is a platform that lets you build and run workflows. A workflow is represented as a DAG (a Directed Acyclic Graph), and contains individual pieces of work called Tasks, arranged with dependencies and data flows taken into account.
We run thousands of tasks a day at my company and have been using Airflow for the better part of 2 years. These dags run every 15 minutes and are generated through config files that can change at any time (fed in from a UI).
The short answer - yes, it can definitely scale to that, depending on your infrastructure. Some of the new 1.10 features should make this easier than the version of 1.8 we run that runs all those tasks. We ran this on a large Mesos/DCOS that took a good deal of fine tuning to get to a stable point.
The long answer - although it can scale to that, we've found that a better solution is multiple Airflow instances with different configurations (scheduler settings,number of workers, etc.) optimized for the types dags they are running. A set of DAGs that run long running machine learning jobs should be hosted on an Airflow instance that is different from the ones running 5 minute ETL jobs. This also makes it easier for different teams to maintain the jobs they are responsible for and makes it easier to iterate on any fine tuning that's needed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With