Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Airflow + celery or dask. For what, when?

Tags:

I read in the official Airflow documentation the following:

enter image description here

What does this mean exactly? What do the authors mean by scaling out? That is, when is it not enough to use Airflow or when would anyone use Airflow in combination with something like Celery? (same for dask)

like image 254
Amelio Vazquez-Reina Avatar asked Mar 15 '18 22:03

Amelio Vazquez-Reina


People also ask

Does Airflow use DASK?

DaskExecutor allows you to run Airflow tasks in a Dask Distributed cluster. Dask clusters can be run on a single machine or on remote networks. For complete details, consult the Distributed documentation. Edit your airflow.

Does Airflow use Celery?

Airflow comes with various executors, but the most widely used among those is Airflow Celery Executor used for scaling out by distributing the workload on multiple Celery workers that can run on different machines. CeleryExecutor works with some workers it has to distribute the tasks with the help of messages.

How do you use Celery executor Airflow?

CeleryExecutor is one of the ways you can scale out the number of workers. For this to work, you need to setup a Celery backend (RabbitMQ, Redis, …) and change your airflow. cfg to point the executor parameter to CeleryExecutor and provide the related Celery settings.

How Apache airflow distributes jobs on Celery workers?

It can distribute tasks on multiple workers by using a protocol to transfer jobs from the main application to Celery workers. It relies on a message broker to transfer the messages.


1 Answers

In Airflow terminology an "Executor" is the component responsible for running your task. The LocalExecutor does this by spawning threads on the computer Airflow runs on and lets the thread execute the task.

Naturally your capacity is then limited by the available resources on the local machine. The CeleryExecutor distributes the load to several machines. The executor itself publishes a request to execute a task to a queue, and one of several worker nodes picks up the request and executes it. You can now scale the cluster of worker nodes to increase overall capacity.

Finally, and not ready yet, there's a KubernetesExecutor in the works (link). This will run tasks on a Kubernetes cluster. This will not only give your tasks complete isolation since they're run in containers, you can also leverage the existing capabilities in Kubernetes to for instance auto scale your cluster so that you always have an optimal amount of resources available.

like image 78
gogstad Avatar answered Oct 03 '22 14:10

gogstad