Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Airflow structure/organization of Dags and tasks

My questions :

  • What is a good directory structure in order to organize your dags and tasks? (the dags examples show only couple of tasks)
  • I currently have my dags at the root of the dags folder and my tasks in separate directories, not sure is the way to do it ?
  • Should we use zip files ? https://github.com/apache/incubator-airflow/blob/a1f4227bee1a70531cfa90769149322513cb6f92/airflow/models.py#L280
like image 459
nono Avatar asked Jun 07 '17 23:06

nono


People also ask

What is DAG and task in Airflow?

In Airflow, a DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. A DAG is defined in a Python script, which represents the DAGs structure (tasks and their dependencies) as code.

What are the precedence rules for the tasks in Airflow?

The precedence rules for a task are as follows: Explicitly passed arguments. Values that exist in the default_args dictionary. The operator's default value, if one exists.

How does Airflow look for DAGs?

Airflow looks in your DAGS_FOLDER for modules that contain DAG objects in their global namespace and adds the objects it finds in the DagBag . Knowing this, all we need is a way to dynamically assign variable in the global namespace.

What happens if two DAGs have the same DAG ID?

The unique identifier or DAG ID When you instantiate a DAG object you have to specify a DAG ID. The DAG ID must be unique across all of your DAGs. You should never have two DAGs with the same DAG ID otherwise only one DAG will show up and you might get unexpected behaviours.


2 Answers

I use something like this.

  • A project is normally something completely separate or unique. Perhaps DAGs to process files that we receive from a certain client which will be completely unrelated to everything else (almost certainly a separate database schema)
  • I have my operators, hooks, and some helper scripts (delete all Airflow data for a certain DAG, etc.) in a common folder
  • I used to have a single git repository for the entire Airflow folder, but now I have a separate git per project (makes it more organized and easier to grant permissions on Gitlab since projects are so unrelated). This means that each project folder also as a .git and .gitignore, etc as well
  • I tend to save the raw data and then 'rest' a modified copy of the data which is exactly what gets copied into the database. I have to heavily modify some of the raw data due to different formats from different clients (Excel, web scraping, HTML email scraping, flat files, queries from SalesForce or other database sources...)

Example tree:

├───dags │   ├───common │   │   ├───hooks │   │   │       pysftp_hook.py │   │   │ │   │   ├───operators │   │   │       docker_sftp.py │   │   │       postgres_templated_operator.py │   │   │ │   │   └───scripts │   │           delete.py │   │ │   ├───project_1 │   │   │   dag_1.py │   │   │   dag_2.py │   │   │ │   │   └───sql │   │           dim.sql │   │           fact.sql │   │           select.sql │   │           update.sql │   │           view.sql │   │ │   └───project_2 │       │   dag_1.py │       │   dag_2.py │       │ │       └───sql │               dim.sql │               fact.sql │               select.sql │               update.sql │               view.sql │ └───data     ├───project_1     │   ├───modified     │   │       file_20180101.csv     │   │       file_20180102.csv     │   │     │   └───raw     │           file_20180101.csv     │           file_20180102.csv     │     └───project_2         ├───modified         │       file_20180101.csv         │       file_20180102.csv         │         └───raw                 file_20180101.csv                 file_20180102.csv 

Update October 2021. I have a single repository for all projects now. All of my transformation scripts are in the plugins folder (which also contains hooks and operators - basically any code which I import into my DAGs). DAG code I try to keep pretty bare so it basically just dictates the schedules and where data is loaded to and from.

├───dags │   │ │   ├───project_1 │   │     dag_1.py │   │     dag_2.py │   │ │   └───project_2 │         dag_1.py │         dag_2.py │ ├───plugins │   ├───hooks │   │      pysftp_hook.py |   |      servicenow_hook.py │   │    │   ├───sensors │   │      ftp_sensor.py |   |      sql_sensor.py |   | │   ├───operators │   │      servicenow_to_azure_blob_operator.py │   │      postgres_templated_operator.py │   | │   ├───scripts │       ├───project_1 |       |      transform_cases.py |       |      common.py │       ├───project_2 |       |      transform_surveys.py |       |      common.py │       ├───common |             helper.py |             dataset_writer.py | .airflowignore | Dockerfile | docker-stack-airflow.yml 
like image 178
trench Avatar answered Oct 21 '22 09:10

trench


I would love to benchmark folder structure with other people as well. Maybe it will depend on what you are using Airflow to but I will share my case. I am doing data pipelines to build a data warehouse so in high level I basically have two steps:

  1. Dump a lot of data into a data-lake (directly accessible only to a few people)
  2. Load data from data lake into a analytic database where the data will be modeled and exposed to dashboard applications (many sql queries to model the data)

Today I organize the files into three main folders that try to reflect the logic above:

├── dags │   ├── dag_1.py │   └── dag_2.py ├── data-lake │   ├── data-source-1 │   └── data-source-2 └── dw     ├── cubes     │   ├── cube_1.sql     │   └── cube_2.sql     ├── dims     │   ├── dim_1.sql     │   └── dim_2.sql     └── facts         ├── fact_1.sql         └── fact_2.sql 

This is more or less my basic folder structure.

like image 33
fernandosjp Avatar answered Oct 21 '22 08:10

fernandosjp