Apache Airflow or Apache Beam for data processing and job scheduling

Tags:

I'm trying to give useful information but I am far from being a data engineer.

I am currently using the python library pandas to execute a long series of transformation to my data which has a lot of inputs (currently CSV and excel files). The outputs are several excel files. I would like to be able to execute scheduled monitored batch jobs with parallel computation (I mean not as sequential as what I'm doing with pandas), once a month.

I don't really know Beam or Airflow, I quickly read through the docs and it seems that both can achieve that. Which one should I use ?

660

asked May 09 '18 09:05

LouisB

2 Answers

The other answers are quite technical and hard to understand. I was in your position before so I'll explain in simple terms.

Airflow can do anything. It has BashOperator and PythonOperator which means it can run any bash script or any Python script.
It is a way to organize (setup complicated data pipeline DAGs), schedule, monitor, trigger re-runs of data pipelines, in a easy-to-view and use UI.
Also, it is easy to setup and everything is in familiar Python code.
Doing pipelines in an organized manner (i.e using Airflow) means you don't waste time debugging a mess of data processing (cron) scripts all over the place.
Nowadays (roughly year 2020 onwards), we call it an orchestration tool.

Apache Beam is a wrapper for the many data processing frameworks (Spark, Flink etc.) out there.
The intent is so you just learn Beam and can run on multiple backends (Beam runners).
If you are familiar with Keras and TensorFlow/Theano/Torch, the relationship between Keras and its backends is similar to the relationship between Beam and its data processing backends.

Google Cloud Platform's Cloud Dataflow is one backend for running Beam on.
They call it the Dataflow runner.

GCP's offering, Cloud Composer, is a managed Airflow implementation as a service, running in a Kubernetes cluster in Google Kubernetes Engine (GKE).

So you can either:

manual Airflow implementation, doing data processing on the instance itself (if your data is small (or your instance is powerful enough), you can process data on the machine running Airflow. This is why many are confused if Airflow can process data or not)
manual Airflow implementation calling Beam jobs
Cloud Composer (managed Airflow as a service) calling jobs in Cloud Dataflow
Cloud Composer running data processing containers in Composer's Kubernetes cluster environment itself, using Airflow's KubernetesPodOperator (KPO)
Cloud Composer running data processing containers in Composer's Kubernetes cluster environment with Airflow's KPO, but this time in a better isolated fashion by creating a new node-pool and specifying that the KPO pods are to be run in the new node-pool

My personal experience:
Airflow is lightweight and not difficult to learn (easy to implement), you should use it for your data pipelines whenever possible.
Also, since many companies are looking for experience using Airflow, if you're looking to be a data engineer you should probably learn it
Also, managed Airflow (I've only used GCP's Composer so far) is much more convenient than running Airflow yourself, and managing the airflow webserver and scheduler processes.

answered Jan 01 '23 08:01

cryanbhu

Apache Airflow and Apache Beam look quite similar on the surface. Both of them allow you to organise a set of steps that process your data and both ensure the steps run in the right order and have their dependencies satisfied. Both allow you to visualise the steps and dependencies as a directed acyclic graph (DAG) in a GUI.

But when you dig a bit deeper there are big differences in what they do and the programming models they support.

Airflow is a task management system. The nodes of the DAG are tasks and Airflow makes sure to run them in the proper order, making sure one task only starts once its dependency tasks have finished. Dependent tasks don't run at the same time but only one after another. Independent tasks can run concurrently.

Beam is a dataflow engine. The nodes of the DAG form a (possibly branching) pipeline. All the nodes in the DAG are active at the same time, and they pass data elements from one to the next, each doing some processing on it.

The two have some overlapping use cases but there are a lot of things only one of the two can do well.

Airflow manages tasks, which depend on one another. While this dependency can consist of one task passing data to the next one, that is not a requirement. In fact Airflow doesn't even care what the tasks do, it just needs to start them and see if they finished or failed. If tasks need to pass data to one another you need to co-ordinate that yourself, telling each task where to read and write its data, e.g. a local file path or a web service somewhere. Tasks can consist of Python code but they can also be any external program or a web service call.

In Beam, your step definitions are tightly integrated with the engine. You define the steps in a supported programming language and they run inside a Beam process. Handling the computation in an external process would be difficult if possible at all*, and is certainly not the way Beam is supposed to be used. Your steps only need to worry about the computation they're performing, not about storing or transferring the data. Transferring the data between different steps is handled entirely by the framework.

In Airflow, if your tasks process data, a single task invocation typically does some transformation on the entire dataset. In Beam, the data processing is part of the core interfaces so it can't really do anything else. An invocation of a Beam step typically handles a single or a few data elements and not the full dataset. Because of this Beam also supports unbounded length datasets, which is not something Airflow can natively cope with.

Another difference is that Airflow is a framework by itself, but Beam is actually an abstraction layer. Beam pipelines can run on Apache Spark, Apache Flink, Google Cloud Dataflow and others. All of these support a more or less similar programming model. Google has also cloudified Airflow into a service as Google Cloud Compose by the way.

*Apache Spark's support for Python is actually implemented by running a full Python interpreter in a subprocess, but this is implemented at the framework level.

answered Jan 01 '23 07:01

JanKanis

Related questions
                            
                                Pandas column values to columns?
                            
                                Python "TypeError: unhashable type: 'slice'" for encoding categorical data
                            
                                How to put legend outside the plot with pandas
                            
                                Pandas left outer join multiple dataframes on multiple columns
                            
                                Remove index column while saving csv in pandas
                            
                                pandas to_csv output quoting issue
                            
                                Pandas: drop columns with all NaN's
                            
                                How to convert list of model objects to pandas dataframe?
                            
                                Pandas - Plotting a stacked Bar Chart
                            
                                How to make a pandas crosstab with percentages?
                            
                                How to concatenate multiple column values into a single column in Pandas dataframe
                            
                                how to merge two data frames based on particular column in pandas python?
                            
                                PIP Install Numpy throws an error "ascii codec can't decode byte 0xe2"
                            
                                Pandas: sum up multiple columns into one column without last column
                            
                                pandas DataFrame "no numeric data to plot" error
                            
                                Pandas 'describe' is not returning summary of all columns
                            
                                Remove non-numeric rows in one column with pandas
                            
                                Copy all values in a column to a new column in a pandas dataframe
                            
                                groupby weighted average and sum in pandas dataframe
                            
                                datetime to string with series in pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apache Airflow or Apache Beam for data processing and job scheduling

Tags:

pandas

airflow

apache-beam

LouisB

People also ask

2 Answers

cryanbhu

JanKanis

Recent Activity

Donate For Us