I need to have several identical (differing only in arguments) top-level <code>DAG</code>s that can also be triggered together with following constraints / assumptions: <ul> <li>Individual top-level DAGs will have <code>schedule_interval=None</code> as they will only need occasional manual triggering </li> <li>The series of DAGs, however, needs to run daily</li> <li> Order and number of DAGs in series is fixed (known ahead of writing code) and changes rarely (once in a few months)</li> <li>Irrespective of whether a DAG fails or succeeds, the chain of triggering must not break</li> <li>Currently they must be run together in series; in future they may require parallel triggering</li> </ul> <hr> So I created one file for each DAG in my <code>dags</code> directory and now I must wire them up for sequential execution. I have identified two ways this could be done: <ol> <li> <code>SubDagOperator</code> <ul> <li>Works without a glitch in my demo </li> <li>Can lead to deadlocks but there are easy solutions; still there's a lot of haze around using them </li> <li>SubDag's <code>dag_id</code> must be prefixed by it's parent's, that would force absurd IDs on top-level DAGs that are supposed to be functional independently too</li> </ul> </li> <li> <code>TriggerDagRunOperator</code> <ul> <li>Works in my demo but runs in parallel (not sequentially) as it doesn't wait for triggered DAG to finish before moving onto next one</li> <li> <code>ExternalTaskSensor</code> might help overcome above limitation but it would make things very messy</li> </ul> </li> </ol> <hr> My questions are <ul> <li>How to overcome limitation of <code>parent_id</code> prefix in <code>dag_id</code> of <code>SubDag</code>s?</li> <li>How to force <code>TriggerDagRunOperator</code>s to await completion of DAG?</li> <li>Any alternate / better way to wire-up independent (top-level) DAGs together?</li> <li>Is there a workaround for my approach of creating separate files (for DAGs that differ only in input) for each top-level DAG?</li> </ul> <hr> I'm using puckel/docker-airflow with <ul> <li><code>Airflow 1.9.0-4</code></li> <li><code>Python 3.6-slim</code></li> <li> <code>CeleryExecutor</code> with <code>redis:3.2.7</code> </li> </ul> <hr> EDIT-1 Clarifying @Viraj Parekh's queries <blockquote> Can you give some more detail on what you mean by awaiting completion of the DAG before getting triggered? </blockquote> When I trigger the <code>import_parent_v1</code> DAG, all the 3 external DAGs that it is supposed to fire using <code>TriggerDagRunOperator</code> start running parallely even when I chain them sequentially. Actually the logs indicate that while they are fired one-after another, the execution moves onto next DAG (<code>TriggerDagRunOperator</code>) before the previous one has finished. <img src="https://i.stack.imgur.com/EQQTP.png" alt="enter image description here"> <img src="https://i.stack.imgur.com/ih3yW.png" alt="enter image description here"> NOTE: In this example, the top-level DAGs are named as <code>importer_child_v1_db_X</code> and their corresponding <code>task_id</code>s (for <code>TriggerDagRunOperator</code>) are named as <code>importer_v1_db_X</code> <blockquote> Would it be possible to just have the TriggerDagRunOperator be the last task in a DAG? </blockquote> I have to chain several similar (differing only in arguments) DAGs together in a workflow that triggers them one-by-one. So there isn't just one <code>TriggerDagRunOperator</code> that I could put at last, there are many (here 3, but would be upto 15 in production)

Taking hints from @Viraj Parekh's answer, I was able to make <code>TriggerDagRunOperator</code> work in the intended fashion. I'm hereby posting my (partial) answer; will update as and when things become clear. <hr> <blockquote> How to overcome limitation of <code>parent_id</code> prefix in <code>dag_id</code> of <code>SubDag</code>s? </blockquote> As told @Viraj, there's no straight way of achieving this. Extending <code>SubDagOperator</code> to remove this check might work but I decided to steer clear of it <hr> <blockquote> How to force <code>TriggerDagRunOperator</code>s to await completion of DAG? </blockquote> <ul> <li>Looking at the implementation, it becomes clear that the job of <code>TriggerDagRunOperator</code> is just to trigger external DAG; and that's about it. By default, it is not supposed to wait for completion of DAG. Therefore the behaviour I'm observing is understandable.</li> <li><code>ExternalTaskSensor</code> is the obvious way out. However while learning basics of <code>Airflow</code> I was relying on manual triggering of DAGs (<code>schedule_interval=None</code>). In such case, <code>ExternalTaskSensor</code> makes it difficult to accurately specify <code>execution_date</code> for the external task (who's completion is being awaited), failing which the sensor gets stuck.</li> <li> So taking hint from implementation, I made minor adjustment to behaviour of <code>ExternalTaskSensor</code> by awaiting completion of all <code>task_instance</code>s of concerned task having <code>execution_date[external_task] >= execution_date[TriggerDagRunOperator] + execution_delta</code> This achieves the desired result: external DAGs run one-after-other in sequence. </li> </ul> <hr> <blockquote> Is there a workaround for my approach of creating separate files (for DAGs that differ only in input) for each top-level DAG? </blockquote> Again going by @Viraj this can be done by assigning DAGs to global scope using <code>globals()[dag_id] = DAG(..)</code> <hr> EDIT-1 Maybe I was referring to incorrect resource (the link above is already dead), but <code>ExternalTaskSensor</code> already includes the params <code>execution_delta</code> & <code>execution_date_fn</code> to easily restrict <code>execution_date</code>(s) for the task being sensed.

Wiring top-level DAGs together

Tags:

airflow

I need to have several identical (differing only in arguments) top-level DAGs that can also be triggered together with following constraints / assumptions:

Individual top-level DAGs will have schedule_interval=None as they will only need occasional manual triggering
The series of DAGs, however, needs to run daily
Order and number of DAGs in series is fixed (known ahead of writing code) and changes rarely (once in a few months)
Irrespective of whether a DAG fails or succeeds, the chain of triggering must not break
Currently they must be run together in series; in future they may require parallel triggering

So I created one file for each DAG in my dags directory and now I must wire them up for sequential execution. I have identified two ways this could be done:

SubDagOperator
- Works without a glitch in my demo
- Can lead to deadlocks but there are easy solutions; still there's a lot of haze around using them
- SubDag's dag_id must be prefixed by it's parent's, that would force absurd IDs on top-level DAGs that are supposed to be functional independently too
TriggerDagRunOperator
- Works in my demo but runs in parallel (not sequentially) as it doesn't wait for triggered DAG to finish before moving onto next one
- ExternalTaskSensor might help overcome above limitation but it would make things very messy

My questions are

How to overcome limitation of parent_id prefix in dag_id of SubDags?
How to force TriggerDagRunOperators to await completion of DAG?
Any alternate / better way to wire-up independent (top-level) DAGs together?
Is there a workaround for my approach of creating separate files (for DAGs that differ only in input) for each top-level DAG?

I'm using puckel/docker-airflow with

Airflow 1.9.0-4
Python 3.6-slim
CeleryExecutor with redis:3.2.7

EDIT-1

Clarifying @Viraj Parekh's queries

Can you give some more detail on what you mean by awaiting completion of the DAG before getting triggered?

When I trigger the import_parent_v1 DAG, all the 3 external DAGs that it is supposed to fire using TriggerDagRunOperator start running parallely even when I chain them sequentially. Actually the logs indicate that while they are fired one-after another, the execution moves onto next DAG (TriggerDagRunOperator) before the previous one has finished. enter image description here NOTE: In this example, the top-level DAGs are named as importer_child_v1_db_X and their corresponding task_ids (for TriggerDagRunOperator) are named as importer_v1_db_X

Would it be possible to just have the TriggerDagRunOperator be the last task in a DAG?

I have to chain several similar (differing only in arguments) DAGs together in a workflow that triggers them one-by-one. So there isn't just one TriggerDagRunOperator that I could put at last, there are many (here 3, but would be upto 15 in production)

617

asked Jul 13 '18 12:07

y2k-shubham

1 Answers

Taking hints from @Viraj Parekh's answer, I was able to make TriggerDagRunOperator work in the intended fashion. I'm hereby posting my (partial) answer; will update as and when things become clear.

How to overcome limitation of parent_id prefix in dag_id of SubDags?

As told @Viraj, there's no straight way of achieving this. Extending SubDagOperator to remove this check might work but I decided to steer clear of it

How to force TriggerDagRunOperators to await completion of DAG?

Looking at the implementation, it becomes clear that the job of TriggerDagRunOperator is just to trigger external DAG; and that's about it. By default, it is not supposed to wait for completion of DAG. Therefore the behaviour I'm observing is understandable.
ExternalTaskSensor is the obvious way out. However while learning basics of Airflow I was relying on manual triggering of DAGs (schedule_interval=None). In such case, ExternalTaskSensor makes it difficult to accurately specify execution_date for the external task (who's completion is being awaited), failing which the sensor gets stuck.
So taking hint from implementation, I made minor adjustment to behaviour of ExternalTaskSensor by awaiting completion of all task_instances of concerned task having

execution_date[external_task] >= execution_date[TriggerDagRunOperator] + execution_delta

This achieves the desired result: external DAGs run one-after-other in sequence.

Is there a workaround for my approach of creating separate files (for DAGs that differ only in input) for each top-level DAG?

Again going by @Viraj this can be done by assigning DAGs to global scope using globals()[dag_id] = DAG(..)

EDIT-1

Maybe I was referring to incorrect resource (the link above is already dead), but ExternalTaskSensor already includes the params execution_delta & execution_date_fn to easily restrict execution_date(s) for the task being sensed.

168

answered Oct 13 '22 18:10

y2k-shubham

Related questions
                            
                                airflow webserver starting - gunicorn workers shutting down
                            
                                How to resolve DB connection invalidated warning in Airflow Scheduler?
                            
                                Airflow scheduler stuck
                            
                                Bash Operator error: No such file or directory in airflow
                            
                                Re-run part of an Airflow Subdag
                            
                                Apache Airflow: Control over logging [Disable/Adjust logging level]
                            
                                Airflow Dag Folder - How to ignore notebook checkpoints
                            
                                How to Trigger a DAG on the success of a another DAG in Airflow using Python?
                            
                                Airflow Generate Dynamic Tasks in Single DAG , Task N+1 is Dependent on TaskN
                            
                                Airflow Python Script with execution_date in op_kwargs
                            
                                Issues running airflow scheduler as a daemon process
                            
                                WARNING - State of this instance has been externally set to success. Taking the poison pill
                            
                                How to show the logging output from DockerOperator?
                            
                                How to wait for an asynchronous event in a task of a DAG in a workflow implemented using Airflow?
                            
                                How to best run Apache Airflow tasks on a Kubernetes cluster?
                            
                                Trying to run apache airflow on ubuntu server with systemd
                            
                                Get Pycharm to see dynamically generated python modules
                            
                                Why tasks are stuck in None state in Airflow 1.10.2 after a trigger_dag
                            
                                Airflow no module named for directory in airflow_home directory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Wiring top-level DAGs together

Tags:

airflow

y2k-shubham

People also ask

1 Answers

y2k-shubham

Recent Activity

Donate For Us