I have a main dag which retrieves a file and splits the data in this file to separate csv files. I have another set of tasks that must be done for each file of these csv files. eg (Uploading to GCS, Inserting to BigQuery) How can I generate a SubDag for each file dynamically based on the number of files? SubDag will define the tasks like Uploading to GCS, Inserting to BigQuery, deleting the csv file) So right now, this is what it looks like <pre class="prettyprint"><code>main_dag = DAG(....) download_operator = SFTPOperator(dag = main_dag, ...) # downloads file transform_operator = PythonOperator(dag = main_dag, ...) # Splits data and writes csv files def subdag_factory(): # Will return a subdag with tasks for uploading to GCS, inserting to BigQuery. ... ... </code></pre> How can I call the subdag_factory for each file generated in transform_operator?

Airflow deals with DAG in two different ways. <ol> <li> One way is when you define your dynamic DAG in one python file and put it into <code>dags_folder</code>. And it generates dynamic DAG based on external source (config files in other dir, SQL, noSQL, etc). Less changes to the structure of the DAG - better (actually just true for all situations). For instance, our DAG file generates dags for every record(or file), it generates dag_id as well. Every airflow scheduler's heartbeat this code goes through the list and generates the corresponding DAG. Pros :) not too much, just one code file to change. Cons a lot and it goes to the way Airflow works. For every new DAG(dag_id) airflow writes steps into database so when number of steps changes or name of the step it might break the web server. When you delete a DAG from your list it became kind of orphanage you can't access it from web interface and have no control over a DAG you can't see the steps, you can't restart and so on. If you have a static list of DAGs and IDes are not going to change but steps occasionally do this method is acceptable. </li> <li> So at some point I've come up with another solution. You have static DAGs (they are still dynamic the script generates them, but their structure, IDes do not change). So instead of one script that walks trough the list like in directory and generates DAGs. You do two static DAGs, one monitors the directory periodically (*/10 ****), the other one is triggered by the first. So when a new file/files appeared, the first DAG triggers the second one with arg conf. Next code has to be executed for every file in the directory. <pre class="prettyprint"> session = settings.Session() dr = DagRun( dag_id=dag_to_be_triggered, run_id=uuid_run_id, conf={'file_path': path_to_the_file}, execution_date=datetime.now(), start_date=datetime.now(), external_trigger=True) logging.info("Creating DagRun {}".format(dr)) session.add(dr) session.commit() session.close() </pre> </li> </ol> The triggered DAG can receive the conf arg and finish all the required tasks for the particular file. To access the conf param use this: <pre class="prettyprint"> def work_with_the_file(**context): path_to_file = context['dag_run'].conf['file_path'] \ if 'file_path' in context['dag_run'].conf else None if not path_to_file: raise Exception('path_to_file must be provided') </pre> Pros all the flexibility and functionality of Airflow Cons the monitor DAG can be spammy.

How to dynamically create subdags in Airflow

Tags:

airflow

I have a main dag which retrieves a file and splits the data in this file to separate csv files. I have another set of tasks that must be done for each file of these csv files. eg (Uploading to GCS, Inserting to BigQuery) How can I generate a SubDag for each file dynamically based on the number of files? SubDag will define the tasks like Uploading to GCS, Inserting to BigQuery, deleting the csv file)

So right now, this is what it looks like

main_dag = DAG(....)
download_operator = SFTPOperator(dag = main_dag, ...)  # downloads file
transform_operator = PythonOperator(dag = main_dag, ...) # Splits data and writes csv files

def subdag_factory(): # Will return a subdag with tasks for uploading to GCS, inserting to BigQuery.
    ...
    ...

How can I call the subdag_factory for each file generated in transform_operator?

218

asked Feb 23 '18 12:02

AshanPerera

2 Answers

I tried creating subdags dynamically as follows

# create and return and DAG
def create_subdag(dag_parent, dag_id_child_prefix, db_name):
    # dag params
    dag_id_child = '%s.%s' % (dag_parent.dag_id, dag_id_child_prefix + db_name)
    default_args_copy = default_args.copy()

    # dag
    dag = DAG(dag_id=dag_id_child,
              default_args=default_args_copy,
              schedule_interval='@once')

    # operators
    tid_check = 'check2_db_' + db_name
    py_op_check = PythonOperator(task_id=tid_check, dag=dag,
                                 python_callable=check_sync_enabled,
                                 op_args=[db_name])

    tid_spark = 'spark2_submit_' + db_name
    py_op_spark = PythonOperator(task_id=tid_spark, dag=dag,
                                 python_callable=spark_submit,
                                 op_args=[db_name])

    py_op_check >> py_op_spark
    return dag

# wrap DAG into SubDagOperator
def create_subdag_operator(dag_parent, db_name):
    tid_subdag = 'subdag_' + db_name
    subdag = create_subdag(dag_parent, tid_prefix_subdag, db_name)
    sd_op = SubDagOperator(task_id=tid_subdag, dag=dag_parent, subdag=subdag)
    return sd_op

# create SubDagOperator for each db in db_names
def create_all_subdag_operators(dag_parent, db_names):
    subdags = [create_subdag_operator(dag_parent, db_name) for db_name in db_names]
    # chain subdag-operators together
    airflow.utils.helpers.chain(*subdags)
    return subdags


# (top-level) DAG & operators
dag = DAG(dag_id=dag_id_parent,
          default_args=default_args,
          schedule_interval=None)

subdag_ops = create_subdag_operators(dag, db_names)

Note that the list of inputs for which subdags are created, here db_names, can either be declared statically in the python file or could be read from external source.

The resulting DAG looks like this enter image description here

Diving into SubDAG(s)

enter image description here

135

answered Nov 15 '22 08:11

y2k-shubham

Airflow deals with DAG in two different ways.

One way is when you define your dynamic DAG in one python file and put it into dags_folder. And it generates dynamic DAG based on external source (config files in other dir, SQL, noSQL, etc). Less changes to the structure of the DAG - better (actually just true for all situations). For instance, our DAG file generates dags for every record(or file), it generates dag_id as well. Every airflow scheduler's heartbeat this code goes through the list and generates the corresponding DAG. Pros :) not too much, just one code file to change. Cons a lot and it goes to the way Airflow works. For every new DAG(dag_id) airflow writes steps into database so when number of steps changes or name of the step it might break the web server. When you delete a DAG from your list it became kind of orphanage you can't access it from web interface and have no control over a DAG you can't see the steps, you can't restart and so on. If you have a static list of DAGs and IDes are not going to change but steps occasionally do this method is acceptable.
So at some point I've come up with another solution. You have static DAGs (they are still dynamic the script generates them, but their structure, IDes do not change). So instead of one script that walks trough the list like in directory and generates DAGs. You do two static DAGs, one monitors the directory periodically (*/10 ****), the other one is triggered by the first. So when a new file/files appeared, the first DAG triggers the second one with arg conf. Next code has to be executed for every file in the directory.
```
     session = settings.Session()
     dr = DagRun(
                 dag_id=dag_to_be_triggered,
                 run_id=uuid_run_id,
                 conf={'file_path': path_to_the_file},
                 execution_date=datetime.now(),
                 start_date=datetime.now(),
                 external_trigger=True)
     logging.info("Creating DagRun {}".format(dr))
     session.add(dr)
     session.commit()
     session.close()
 
```

The triggered DAG can receive the conf arg and finish all the required tasks for the particular file. To access the conf param use this:

    def work_with_the_file(**context):
        path_to_file = context['dag_run'].conf['file_path'] \
            if 'file_path' in context['dag_run'].conf else None

        if not path_to_file:
            raise Exception('path_to_file must be provided')

Pros all the flexibility and functionality of Airflow

Cons the monitor DAG can be spammy.

answered Nov 15 '22 07:11

Andrey Kartashov

Related questions
                            
                                How to log inside Python function in PythonOperator in Airflow
                            
                                Use airflow hive operator and output to a text file
                            
                                Airflow BashOperator doesn't work but PythonOperator does
                            
                                Apache Airflow pools: used slots > available slots
                            
                                When I init a dag with a Variable param, it raises an Exception
                            
                                Google Cloud Composer taking too long to install dependencies
                            
                                Airflow - got an unexpected keyword argument 'dag'
                            
                                apache airflow configuration is empty and dags && plugins missing
                            
                                Trouble with connection between Apache Airflow and AWS Glue
                            
                                airflow.exceptions.AirflowException: Use keyword arguments when initializing operators
                            
                                Airflow ExternalTaskSensor execution timeout
                            
                                Faster alternative to apache airflow for workflows with many tasks
                            
                                Airflow: Set custom run_id for TriggerDagRunOperator
                            
                                how to get latest execution time of a dag run in airflow
                            
                                Airflow - call a operator inside a function
                            
                                Running DBT within Airflow through the Docker Operator
                            
                                What is causing airflow webserver to fail and restart on docker for Mac?
                            
                                MWAA Airflow 2.0 in AWS Snowflake connection not showing
                            
                                Problems using Airflow v1.9 Python Operator
                            
                                How to increase tasks queued per second?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With