How to deploy apache airflow (formally known as airbnb's airflow) scheduler in high availability? I am not asking about the backend DB or RabbitMQ that should obviously be deployed in high availability configuration. My main focus is the scheduler - is there something special needs to be done?

My personal experience was to follow the instructions I found for some best practices; that is to restart the scheduler every 10 runs ( -N 10 ) and use this software when possible: https://github.com/teamclairvoyant/airflow-scheduler-failover-controller I also use a DAG which pings a monitoring system to be sure that the scheduler has not gone away.

Airflow setup for high availability

2 Answers

After a bit digging I found that it is not safe to run multiple schedulers simoultanously, this means that out of the box - airflow schedulers are not safe to use in high availablity environments.

The airflow team are planning to solve this issue by adding a lock mechanism on the DAG data structure, but this is not implemented yet (I checked by running 2 schedulers and saw that they schedule the same dag instances which is not good). This is described here: https://groups.google.com/forum/#!topic/airbnb_airflow/-1wKa3OcwME

I did found a way to workaround this high availalbilty issue by wrapping the schedulers with my own code and use cluster tools for leader election (I personanlly use consul for this purpose). This way only the elected master is running the scheduler and when the master is down the slave replaces him.

Please consider this when u use airflow in high availabilty environments since out of the box, airflow scheduler is currently not suitable for this (unless you solve this issue yourself).

Edit - an alternative approach to the master slave solution is to use a cluster manager/scheduler to make sure that only one airflow scheduler instance is always available. This approach relies on the self healing abilities of the cluster manager u have. For example both mesos and nomad supports this kind of configuration (I presonally chose nomad for its simplicity).

answered Oct 19 '22 21:10

Ofer Eliassaf

My personal experience was to follow the instructions I found for some best practices; that is to restart the scheduler every 10 runs ( -N 10 ) and use this software when possible:

https://github.com/teamclairvoyant/airflow-scheduler-failover-controller

I also use a DAG which pings a monitoring system to be sure that the scheduler has not gone away.

answered Oct 19 '22 22:10

ozw1z5rd

Related questions
                            
                                Kafka: What is the minimum number of brokers required for high availability?
                            
                                Where do services live in Kubernetes?
                            
                                Normalize or Denormalize in high traffic websites
                            
                                Detecting and recovering failed H2 cluster nodes
                            
                                Server Sent Event (SSE) cluster connection handling
                            
                                Solr safe dataimport and core swap on high-traffic website
                            
                                Geo Redundancy in Azure Service Fabric Applications
                            
                                LMAX Replicator Design - How to support high availability?
                            
                                High availability and scalable platform for Java/C++ on Solaris
                            
                                How do I cluster ServiceMix?
                            
                                How does Terracotta work in this situation?
                            
                                Tomcat7 parallel deployment feature: experiences using it on production servers? [closed]
                            
                                Hadoop 2.0 Name Node, Secondary Node and Checkpoint node for High Availability
                            
                                Is there such a thing as a "Non-Functional Use Case"?
                            
                                Advantage of using HAProxy AND Keepalived vs just Keepalived
                            
                                EC2 Amazon High Availability Always On
                            
                                What is the difference between failover vs high availability?
                            
                                Application upgrade in a high availability environment
                            
                                Stop accepting new TCP connections without dropping any existing ones

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Airflow setup for high availability

Tags:

airflow

high-availability

Ofer Eliassaf

People also ask

2 Answers

Ofer Eliassaf

ozw1z5rd

Recent Activity

Donate For Us