My questions : <ul> <li>What is a good directory structure in order to organize your dags and tasks? (the dags examples show only couple of tasks)</li> <li>I currently have my dags at the root of the dags folder and my tasks in separate directories, not sure is the way to do it ?</li> <li>Should we use zip files ? https://github.com/apache/incubator-airflow/blob/a1f4227bee1a70531cfa90769149322513cb6f92/airflow/models.py#L280 </li> </ul>

I would love to benchmark folder structure with other people as well. Maybe it will depend on what you are using Airflow to but I will share my case. I am doing data pipelines to build a data warehouse so in high level I basically have two steps: <ol> <li>Dump a lot of data into a data-lake (directly accessible only to a few people) </li> <li>Load data from data lake into a analytic database where the data will be modeled and exposed to dashboard applications (many sql queries to model the data)</li> </ol> Today I organize the files into three main folders that try to reflect the logic above: <pre class="prettyprint"><code>├── dags │ ├── dag_1.py │ └── dag_2.py ├── data-lake │ ├── data-source-1 │ └── data-source-2 └── dw ├── cubes │ ├── cube_1.sql │ └── cube_2.sql ├── dims │ ├── dim_1.sql │ └── dim_2.sql └── facts ├── fact_1.sql └── fact_2.sql </code></pre> This is more or less my basic folder structure.

Airflow structure/organization of Dags and tasks

2 Answers

I use something like this.

A project is normally something completely separate or unique. Perhaps DAGs to process files that we receive from a certain client which will be completely unrelated to everything else (almost certainly a separate database schema)
I have my operators, hooks, and some helper scripts (delete all Airflow data for a certain DAG, etc.) in a common folder
I used to have a single git repository for the entire Airflow folder, but now I have a separate git per project (makes it more organized and easier to grant permissions on Gitlab since projects are so unrelated). This means that each project folder also as a .git and .gitignore, etc as well
I tend to save the raw data and then 'rest' a modified copy of the data which is exactly what gets copied into the database. I have to heavily modify some of the raw data due to different formats from different clients (Excel, web scraping, HTML email scraping, flat files, queries from SalesForce or other database sources...)

Example tree:

├───dags │   ├───common │   │   ├───hooks │   │   │       pysftp_hook.py │   │   │ │   │   ├───operators │   │   │       docker_sftp.py │   │   │       postgres_templated_operator.py │   │   │ │   │   └───scripts │   │           delete.py │   │ │   ├───project_1 │   │   │   dag_1.py │   │   │   dag_2.py │   │   │ │   │   └───sql │   │           dim.sql │   │           fact.sql │   │           select.sql │   │           update.sql │   │           view.sql │   │ │   └───project_2 │       │   dag_1.py │       │   dag_2.py │       │ │       └───sql │               dim.sql │               fact.sql │               select.sql │               update.sql │               view.sql │ └───data     ├───project_1     │   ├───modified     │   │       file_20180101.csv     │   │       file_20180102.csv     │   │     │   └───raw     │           file_20180101.csv     │           file_20180102.csv     │     └───project_2         ├───modified         │       file_20180101.csv         │       file_20180102.csv         │         └───raw                 file_20180101.csv                 file_20180102.csv

Update October 2021. I have a single repository for all projects now. All of my transformation scripts are in the plugins folder (which also contains hooks and operators - basically any code which I import into my DAGs). DAG code I try to keep pretty bare so it basically just dictates the schedules and where data is loaded to and from.

├───dags │   │ │   ├───project_1 │   │     dag_1.py │   │     dag_2.py │   │ │   └───project_2 │         dag_1.py │         dag_2.py │ ├───plugins │   ├───hooks │   │      pysftp_hook.py |   |      servicenow_hook.py │   │    │   ├───sensors │   │      ftp_sensor.py |   |      sql_sensor.py |   | │   ├───operators │   │      servicenow_to_azure_blob_operator.py │   │      postgres_templated_operator.py │   | │   ├───scripts │       ├───project_1 |       |      transform_cases.py |       |      common.py │       ├───project_2 |       |      transform_surveys.py |       |      common.py │       ├───common |             helper.py |             dataset_writer.py | .airflowignore | Dockerfile | docker-stack-airflow.yml

178

answered Oct 21 '22 09:10

trench

I would love to benchmark folder structure with other people as well. Maybe it will depend on what you are using Airflow to but I will share my case. I am doing data pipelines to build a data warehouse so in high level I basically have two steps:

Dump a lot of data into a data-lake (directly accessible only to a few people)
Load data from data lake into a analytic database where the data will be modeled and exposed to dashboard applications (many sql queries to model the data)

Today I organize the files into three main folders that try to reflect the logic above:

├── dags │   ├── dag_1.py │   └── dag_2.py ├── data-lake │   ├── data-source-1 │   └── data-source-2 └── dw     ├── cubes     │   ├── cube_1.sql     │   └── cube_2.sql     ├── dims     │   ├── dim_1.sql     │   └── dim_2.sql     └── facts         ├── fact_1.sql         └── fact_2.sql

This is more or less my basic folder structure.

answered Oct 21 '22 08:10

fernandosjp

Related questions
                            
                                Schedule a DAG in airflow to run every 5 minutes
                            
                                Airflow: Neither SQLALCHEMY_DATABASE_URI nor SQLALCHEMY_BINDS is set
                            
                                How to configure Airflow dag to run at specific time on daily basis?
                            
                                (Django) ORM in airflow - is it possible?
                            
                                Cannot run apache airflow after fresh install, python import error
                            
                                Is it possible for Airflow scheduler to first finish the previous day's cycle before starting the next?
                            
                                airflow pass parameter from cli
                            
                                Apache Airflow : airflow initdb throws ModuleNotFoundError: No module named 'werkzeug.wrappers.json'; 'werkzeug.wrappers' is not a package error
                            
                                Get Exception details on Airflow on_failure_callback context
                            
                                Airflow: New DAG is not found by webserver
                            
                                Passing parameters to Airflow's jobs through UI
                            
                                Remove Airflow Scheduler logs
                            
                                How to run Airflow PythonOperator in a virtual environment
                            
                                External files in Airflow DAG
                            
                                Where do you view the output from airflow jobs
                            
                                Airflow Python Unit Test?
                            
                                How >> operator defines task dependencies in Airflow?
                            
                                Python script scheduling in airflow
                            
                                Airflow versus AWS Step Functions for workflow
                            
                                Apache Airflow DAG cannot import local module

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Airflow structure/organization of Dags and tasks

Tags:

airflow

apache-airflow

nono

People also ask

2 Answers

trench

fernandosjp

Recent Activity

Donate For Us