How do I setup an Airflow of 2 servers?

Tags:

Trying to split out Airflow processes onto 2 servers. Server A, which has been already running in standalone mode with everything on it, has the DAGs and I'd like to set it as the worker in the new setup with an additional server.

Server B is the new server which would host the metadata database on MySQL.

Can I have Server A run LocalExecutor, or would I have to use CeleryExecutor? Would airflow scheduler has to run on the server that has the DAGs right? Or does it have to run on every server in a cluster? Confused as to what dependencies there are between the processes

693

asked Jul 11 '17 22:07

simplycoding

Video Answer

2 Answers

This article does an excellent job demonstrating how to cluster Airflow onto multiple servers.

Multi-Node (Cluster) Airflow Setup

A more formal setup for Apache Airflow is to distribute the daemons across multiple machines as a cluster.

enter image description here

Benefits

Higher Availability

If one of the worker nodes were to go down or be purposely taken offline, the cluster would still be operational and tasks would still be executed.

Distributed Processing

If you have a workflow with several memory intensive tasks, then the tasks will be better distributed to allow for higher utilizaiton of data across the cluster and provide faster execution of the tasks.

Scaling Workers

Horizontally

You can scale the cluster horizontally and distribute the processing by adding more executor nodes to the cluster and allowing those new nodes to take load off the existing nodes. Since workers don’t need to register with any central authority to start processing tasks, the machine can be turned on and off without any downtime to the cluster.

Vertically

You can scale the cluster vertically by increasing the number of celeryd daemons running on each node. This can be done by increasing the value in the ‘celeryd_concurrency’ config in the {AIRFLOW_HOME}/airflow.cfg file.

Example:

celeryd_concurrency = 30

You may need to increase the size of the instances in order to support a larger number of celeryd processes. This will depend on the memory and cpu intensity of the tasks you’re running on the cluster.

Scaling Master Nodes

You can also add more Master Nodes to your cluster to scale out the services that are running on the Master Nodes. This will mainly allow you to scale out the Web Server Daemon incase there are too many HTTP requests coming for one machine to handle or if you want to provide Higher Availability for that service.

One thing to note is that there can only be one Scheduler instance running at a time. If you have multiple Schedulers running, there is a possibility that multiple instances of a single task will be scheduled. This could cause some major problems with your Workflow and cause duplicate data to show up in the final table if you were running some sort of ETL process.

If you would like, the Scheduler daemon may also be setup to run on its own dedicated Master Node.

enter image description here

Apache Airflow Cluster Setup Steps

Pre-Requisites

The following nodes are available with the given host names:
- master1 - Will have the role(s): Web Server, Scheduler
- master2 - Will have the role(s): Web Server
- worker1 - Will have the role(s): Worker
- worker2 - Will have the role(s): Worker
A Queuing Service is Running. (RabbitMQ, AWS SQS, etc)
- You can install RabbitMQ by following these instructions: Installing RabbitMQ
- If you’re using RabbitMQ, it is recommended that it is also setup to be a cluster for High Availability. Setup a Load Balancer to proxy requests to the RabbitMQ instances.

enter image description here

Additional Documentation

Documentation: https://airflow.incubator.apache.org/
Install Documentation: https://airflow.incubator.apache.org/installation.html
GitHub Repo: https://github.com/apache/incubator-airflow

160

answered Oct 31 '22 18:10

Kyle Bridenstine

All airflow processes need to have the same contents in their airflow_home folder. This includes configuration and dags. If you only want server B to run your MySQL database, you do not need to worry about any airflow specifics. Simply install the database on server B and change your airflow.cfg's sql_alchemy_conn parameter to point to your database on Server B and run airflow initdb from Server A.

If you also want to run airflow processes on server B, you would have to look into scaling using the CeleryExecutor.

answered Oct 31 '22 18:10

Matthijs Brouns

Related questions
                            
                                Apache Airflow tasks are stuck in a 'up_for_retry' state
                            
                                what is the difference between execution_timeout and dagrun_timeout in airflow?
                            
                                Airflow configuration in environment variable not working
                            
                                How can I make sure my airflow DAG runs one day at a time?
                            
                                Can I increase the processing speed by adding more cpus to operators in Airflow?
                            
                                Airflow BigQueryOperator: how to save query result in a partitioned Table?
                            
                                How to configure Google Cloud Composer cost-effectively
                            
                                Issues after Apache Airflow migration from 1.9.0 to 1.10.1
                            
                                Apache Airflow - get all parent task_ids
                            
                                Access parent dag context at subtag creation time in airflow?
                            
                                apache airflow initdb fails at kubernetes_resource_checkingpoint for mysql
                            
                                AttributeError: 'MSVCCompiler' object has no attribute 'linker_exe'
                            
                                Airflow initdb slot_pool does not exists
                            
                                Component Gateway with DataprocOperator on Airflow
                            
                                How to configure Airflow URL in email alert
                            
                                Airflow File Sensor for sensing files on my local drive
                            
                                Airflow - Defining the key,value for a xcom_push function
                            
                                Restarting the airflow scheduler
                            
                                Why are my Airflow tasks being "externally set to failed"?
                            
                                Airflow XCOM KeyError: 'task_instance'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I setup an Airflow of 2 servers?

Tags:

airflow

apache-airflow

simplycoding

People also ask

Video Answer

2 Answers

Kyle Bridenstine

Matthijs Brouns

Recent Activity

Donate For Us