Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Airflow Continous Integration Workflow and Dependency management

Tags:

python

airflow

I'm thinking of starting to use Apache Airflow for a project and am wondering how people manage continuous integration and dependencies with airflow. More specifically Say I have the following set up

3 Airflow servers: dev staging and production.

I have two python DAG'S whose source code I want to keep in seperate repos. The DAG's themselves are simple, basically just use a Python operator to call main(*args, **kwargs). However the actually code that's run by main is very large and stretches several files/modules. Each python code base has different dependencies for example,

Dag1 uses Python2.7 pandas==0.18.1, requests=2.13.0

Dag2 uses Python3.6 pandas==0.20.0 and Numba==0.27 as well as some cythonized code that needs to be compiled

How do I manage Airflow running these two Dag's with completely different dependencies? Also, how do I manage the continuous integration of the code for both these Dags into each different Airflow enivornment (dev, staging, Prod)(do I just get jenkins or something to ssh to the airflow server and do something like git pull origin BRANCH)

Hopefully this question isn't too vague and people see the problems i'm having.

like image 230
Roger Thomas Avatar asked Jul 10 '17 14:07

Roger Thomas


People also ask

Is Airflow a CI CD tool?

CI/CD frameworks enable all kinds of automated process steps based on changes that happen on the source code base. Since Airflow and all its components are defined in source code, it is a fitting approach to create a robust development and deployment framework with CI/CD tools.

Is Airflow A ETL?

Airflow isn't an ETL tool per se. But it manages, structures, and organizes ETL pipelines using Directed Acyclic Graphs (DAGs). DAGs form relationships and dependencies without defining tasks, running a single branch multiple times and skipping branches from sequences when necessary.

Is Airflow a workflow engine?

Apache Airflow is an open-source workflow management platform for data engineering pipelines.

What is workflow in Airflow?

Airflow is a platform that lets you build and run workflows. A workflow is represented as a DAG (a Directed Acyclic Graph), and contains individual pieces of work called Tasks, arranged with dependencies and data flows taken into account.


Video Answer


1 Answers

We use docker to run the code with different dependencies and DockerOperator in airflow DAG, which can run docker containers, also on remote machines (with docker daemon already running). We actually have only one airflow server to run jobs but more machines with docker daemon running, which the airflow executors call.

For continuous integration we use gitlab CI with the Gitlab container registry for each repository. This should be easily doable with Jenkins.

like image 119
Him Avatar answered Sep 22 '22 04:09

Him