As the title implies, looking to understand the difference between catchup = False in dag definition and the latest only operator.
https://airflow.apache.org/docs/stable/scheduler.html https://airflow.apache.org/docs/stable/_modules/airflow/operators/latest_only_operator.html
A LatestOnlyOperator is an extention of the BaseOperator . Tasks made with this Operator will not run (i.e. will be skipped, and will skip also the downstream ones) if the DAG run is not in the latest schedule interval (i.e. the "last run").
Catchup. An Airflow DAG defined with a start_date , possibly an end_date , and a non-dataset schedule, defines a series of intervals which the scheduler turns into individual DAG runs and executes.
When creating a new DAG, you probably want to set a global start_date for your tasks. This can be done by declaring your start_date directly in the DAG() object. The first DagRun to be created will be based on the min(start_date) for all your tasks.
Operators are the building blocks of Airflow DAGs. They contain the logic of how data is processed in a pipeline. Each task in a DAG is defined by instantiating an operator. There are many different types of operators available in Airflow.
Well, they are, I would say, totally different concepts, and they can be used independently. It is true that they could both be used to prevent backfilling, but if that's your only concern then just use catchup=False
. Quoting from this reply by one of the Airflow developers, in fact, it seems clear that the good practice is to use that:
As the author of LatestOnlyOperator, the goal was as a stopgap until catchup=False landed.
But he then goes on saying that LatestOnlyOperator
should be deprecated. I don't agree (as a user of both catchup=False
and LatestOnlyOperator
) and I'll try to explain. My intuition of these two concepts is this:
Catchup = True
In a DAG definition (i.e. when you specify its default_args
) you can set the flag catchup
to True
. If you set this flag to True
and you set the DAG to ON, then the scheduler will create DAG runs for each schedule interval from the start_date
to the "present" and will execute them sequentially. Quoting the documentation:
If the
dag.catchup
value had beenTrue
instead, the scheduler would have created a DAG Run for each completed interval between 2015-12-01 and 2016-01-02 (but not yet one for 2016-01-02, as that interval hasn’t completed) and the scheduler will execute them sequentially.
LatestOnlyOperator
A LatestOnlyOperator
is an extention of the BaseOperator
. Tasks made with this Operator will not run (i.e. will be skipped, and will skip also the downstream ones) if the DAG run is not in the latest schedule interval (i.e. the "last run"). Also quoting from the LatestOnlyOperator
docstring:
"""
Allows a workflow to skip tasks that are not running during the most
recent schedule interval.
If the task is run outside of the latest schedule interval, all
directly downstream tasks will be skipped.
Note that downstream tasks are never skipped if the given DAG_Run is
marked as externally triggered.
"""
Conclusion
You can define your scheduled DAG with catchup=True
and use LatestOnlyOperator
to make sure that some tasks will not be executed during the catchup runs. Moreover LatestOnlyOperator
can be used if you want to re-run some past DAG runs (for example by clearing them in the UI) but you have some tasks (like notifications being sent) that you would want to skip during those re-runs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With