I'm currently developing DAGs for Airflow. I like to use PyCharm and tend to spin up a virtual environment for each of my projects. Airflow depends on an AIRFLOW_HOME folder that gets set during the installation. Subdirectories are then created within this folder by Airflow. I'm interested in how others structure their projects to allow for virtual environments that contain packages (such as <code>facebookads</code>) that are needed for acquiring data - while also easily dropping the DAGs into Airflow's DAGS folder for testing.

In my projects I use: <pre class="prettyprint"><code>- config - config_1.yaml - config_1.env - DAGs - dag_1.py -dag_1_etl_1.sql -dag_1_etl_2.sql -dag_1_etl_3.sql -dag_1_bash_1.sh - dag_2.py - dag_3.py - operators - operator_1.py - operator_2.py - operator_3.py - hooks - hooks_1.py </code></pre> For our use case: 1) Every object that can be reused we store in a separate folder with the same kind of object; 2) Every DAG in terms of SQL must be self-contained to avoid non-mapped dependencies

What is the best project structure to use when developing for airflow?

Tags:

python

pycharm

airflow

I'm currently developing DAGs for Airflow. I like to use PyCharm and tend to spin up a virtual environment for each of my projects.

Airflow depends on an AIRFLOW_HOME folder that gets set during the installation. Subdirectories are then created within this folder by Airflow.

I'm interested in how others structure their projects to allow for virtual environments that contain packages (such as facebookads) that are needed for acquiring data - while also easily dropping the DAGs into Airflow's DAGS folder for testing.

805

asked Apr 26 '18 13:04

James Lloyd

2 Answers

In our current case, we follow a simple structure:

 - dags
  - dag001.py
  - dag001.py
  - helpers
     - dag_001_helpers
         - file01.py
         - file02.py
     - dag_002_helpers
         - file01.py
         - file02.py
   - configs 
     - dag_001_configs
         - file11.json
         - file12.sql
     - dag_002_configs
         - file21.json
         - file22.py

153

answered Oct 22 '22 05:10

Soliman

In my projects I use:

- config
  - config_1.yaml
  - config_1.env
- DAGs
  - dag_1.py
     -dag_1_etl_1.sql
     -dag_1_etl_2.sql
     -dag_1_etl_3.sql
     -dag_1_bash_1.sh
  - dag_2.py
  - dag_3.py
- operators
  - operator_1.py
  - operator_2.py
  - operator_3.py
- hooks
  - hooks_1.py

For our use case: 1) Every object that can be reused we store in a separate folder with the same kind of object;

2) Every DAG in terms of SQL must be self-contained to avoid non-mapped dependencies

answered Oct 22 '22 05:10

Flavio

Related questions
                            
                                How to change index dtype of pandas DataFrame to int32?
                            
                                Alamofire Uploads PNG to Flask with White Background
                            
                                Is it possible to pass Python Future objects between processes?
                            
                                scikit-learn model persistence: pickle vs pmml vs ...?
                            
                                raising an exception that appears to come from the caller
                            
                                ValueError: attempt to get argmax of an empty sequence
                            
                                Using StreamingHttpResponse with Django Rest Framework CSV
                            
                                How to create a modified voronoi algorithm for random points with physical restriction
                            
                                Why does multiprocessing.Lock() not lock shared resource in Python?
                            
                                Keras custom data generator for large hdf5 file which does not fit into memory
                            
                                Keras LSTM for Text Generation keeps repeating a line or a sequence
                            
                                gunicorn access log format
                            
                                Choropleth maps for Germany
                            
                                How do I connect to a remote Neo4j database using gremlin python?
                            
                                executing two class methods at the same time in Python
                            
                                Quickly check large database for edit-distance similarity
                            
                                Can't install Matplotlib in python 3.7
                            
                                Convert structured array to numpy array for use with Scikit-Learn
                            
                                Python how to mock a function within another function
                            
                                How to parse a pandas column of JSON content efficiently?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With