Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How much can Airflow scale?

Has anyone reported how much they've been able to get Airflow to scale at their company? I'm looking at implementing Airflow to execute 5,000+ tasks that will each run hourly, and someday scale that up to 20,000+ tasks. In examining the scheduler it looks like that might be a bottleneck since only one instance of it can run, and I'm concerned with that many tasks the scheduler will struggle to keep up. Should I be?

like image 489
chris.mclennon Avatar asked Aug 28 '18 17:08

chris.mclennon


People also ask

Is Airflow scalable?

Overview. One of Apache Airflow's biggest strengths is its ability to scale with good supporting infrastructure. To make the most of Airflow, there are a few key settings that you should consider modifying as you scale up your data pipelines.

How many tasks can run in parallel Airflow?

Apache Airflow's capability to run parallel tasks, ensured by using Kubernetes and CeleryExecutor, allows you to save a lot of time. You can use it to execute even 1000 parallel tasks in only 5 minutes.

Does Airflow use celery?

Airflow comes with various executors, but the most widely used among those is Airflow Celery Executor used for scaling out by distributing the workload on multiple Celery workers that can run on different machines. CeleryExecutor works with some workers it has to distribute the tasks with the help of messages.

What is Airflow and how it works?

Airflow is a platform that lets you build and run workflows. A workflow is represented as a DAG (a Directed Acyclic Graph), and contains individual pieces of work called Tasks, arranged with dependencies and data flows taken into account.


1 Answers

We run thousands of tasks a day at my company and have been using Airflow for the better part of 2 years. These dags run every 15 minutes and are generated through config files that can change at any time (fed in from a UI).

The short answer - yes, it can definitely scale to that, depending on your infrastructure. Some of the new 1.10 features should make this easier than the version of 1.8 we run that runs all those tasks. We ran this on a large Mesos/DCOS that took a good deal of fine tuning to get to a stable point.

The long answer - although it can scale to that, we've found that a better solution is multiple Airflow instances with different configurations (scheduler settings,number of workers, etc.) optimized for the types dags they are running. A set of DAGs that run long running machine learning jobs should be hosted on an Airflow instance that is different from the ones running 5 minute ETL jobs. This also makes it easier for different teams to maintain the jobs they are responsible for and makes it easier to iterate on any fine tuning that's needed.

like image 178
Viraj Parekh Avatar answered Sep 18 '22 17:09

Viraj Parekh