What are the minimum hardware requirements for setting up an Apache Airflow cluster.
Eg. RAM, CPU, Disk etc for different types of nodes in the cluster.
To set up an airflow cluster, we need to install below components and services: Airflow Webserver: A web interface to query the metadata to monitor and execute DAGs. Airflow Scheduler: It checks the status of the DAG's and tasks in the metadata database, create new ones if necessary, and sends the tasks to the queues.
First developed by Airbnb, it is now under the Apache Software Foundation. Airflow uses Python to create workflows that can be easily scheduled and monitored. Airflow can run anything—it is completely agnostic to what you are running.
Airflow is Python-based but you can execute a program irrespective of the language. For instance, the first stage of your workflow has to execute a C++ based program to perform image analysis and then a Python-based program to transfer that information to S3.
I have had no issues using very small instances in pseudo-distributed mode (32 parallel workers; Postgres backend):
If you want distributed mode, you should be more than fine with that if you keep it homogenous. Airflow shouldn't really do heavy lifting anyways; push the workload out to other things (Spark, EMR, BigQuery, etc).
You will also have to run some kind of messaging queue, like RabbitMQ. I think they take Redis too. However, this doesn't really dramatically impact how you size.
We are running the airflow in AWS with below config
t2.small --> airflow scheduler and webserver
db.t2.small --> postgres for metastore
The parallelism parameter in airflow.cfg is set to 10 and there are around 10 users who access airflow UI
All we do from airflow is ssh to other instances and run the code from there
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With