Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Minimum hardware requirements for Apache Airflow cluster

What are the minimum hardware requirements for setting up an Apache Airflow cluster.

Eg. RAM, CPU, Disk etc for different types of nodes in the cluster.

like image 306
Duleendra Avatar asked Nov 14 '17 03:11

Duleendra


People also ask

How do I set up an Airflow cluster?

To set up an airflow cluster, we need to install below components and services: Airflow Webserver: A web interface to query the metadata to monitor and execute DAGs. Airflow Scheduler: It checks the status of the DAG's and tasks in the metadata database, create new ones if necessary, and sends the tasks to the queues.

What does Apache Airflow run on?

First developed by Airbnb, it is now under the Apache Software Foundation. Airflow uses Python to create workflows that can be easily scheduled and monitored. Airflow can run anything—it is completely agnostic to what you are running.

Is Apache Airflow Python only?

Airflow is Python-based but you can execute a program irrespective of the language. For instance, the first stage of your workflow has to execute a C++ based program to perform image analysis and then a Python-based program to transfer that information to S3.


2 Answers

I have had no issues using very small instances in pseudo-distributed mode (32 parallel workers; Postgres backend):

  • RAM 4096 MB
  • CPU 1000 MHz
  • VCPUs 2 VCPUs
  • Disk 40 GB

If you want distributed mode, you should be more than fine with that if you keep it homogenous. Airflow shouldn't really do heavy lifting anyways; push the workload out to other things (Spark, EMR, BigQuery, etc).

You will also have to run some kind of messaging queue, like RabbitMQ. I think they take Redis too. However, this doesn't really dramatically impact how you size.

like image 192
jastang Avatar answered Oct 21 '22 20:10

jastang


We are running the airflow in AWS with below config

t2.small --> airflow scheduler and webserver

db.t2.small --> postgres for metastore

The parallelism parameter in airflow.cfg is set to 10 and there are around 10 users who access airflow UI

All we do from airflow is ssh to other instances and run the code from there

like image 39
Sathish Avatar answered Oct 21 '22 22:10

Sathish