I'm currently developing DAGs for Airflow. I like to use PyCharm and tend to spin up a virtual environment for each of my projects.
Airflow depends on an AIRFLOW_HOME folder that gets set during the installation. Subdirectories are then created within this folder by Airflow.
I'm interested in how others structure their projects to allow for virtual environments that contain packages (such as facebookads
) that are needed for acquiring data - while also easily dropping the DAGs into Airflow's DAGS folder for testing.
One can take a different approach by increasing the number of threads available on the machine that runs the scheduler process so that the max_threads parameter can be set to a higher value. With a higher value, the Airflow scheduler will be able to more effectively process the increased number of DAGs.
Ideally, a task should flow from none , to scheduled , to queued , to running , and finally to success .
Apache Airflow is used for the scheduling and orchestration of data pipelines or workflows. Orchestration of data pipelines refers to the sequencing, coordination, scheduling, and managing complex data pipelines from diverse sources.
In our current case, we follow a simple structure:
- dags
- dag001.py
- dag001.py
- helpers
- dag_001_helpers
- file01.py
- file02.py
- dag_002_helpers
- file01.py
- file02.py
- configs
- dag_001_configs
- file11.json
- file12.sql
- dag_002_configs
- file21.json
- file22.py
In my projects I use:
- config
- config_1.yaml
- config_1.env
- DAGs
- dag_1.py
-dag_1_etl_1.sql
-dag_1_etl_2.sql
-dag_1_etl_3.sql
-dag_1_bash_1.sh
- dag_2.py
- dag_3.py
- operators
- operator_1.py
- operator_2.py
- operator_3.py
- hooks
- hooks_1.py
For our use case: 1) Every object that can be reused we store in a separate folder with the same kind of object;
2) Every DAG in terms of SQL must be self-contained to avoid non-mapped dependencies
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With