Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between latest only operator and catchup in Airflow

Tags:

python

airflow

As the title implies, looking to understand the difference between catchup = False in dag definition and the latest only operator.

https://airflow.apache.org/docs/stable/scheduler.html https://airflow.apache.org/docs/stable/_modules/airflow/operators/latest_only_operator.html

like image 843
dirtyw0lf Avatar asked Apr 16 '20 14:04

dirtyw0lf


People also ask

What is latest only operator Airflow?

A LatestOnlyOperator is an extention of the BaseOperator . Tasks made with this Operator will not run (i.e. will be skipped, and will skip also the downstream ones) if the DAG run is not in the latest schedule interval (i.e. the "last run").

What is catchup in Airflow?

Catchup. An Airflow DAG defined with a start_date , possibly an end_date , and a non-dataset schedule, defines a series of intervals which the scheduler turns into individual DAG runs and executes.

Is Start_date mandatory in Airflow DAG?

When creating a new DAG, you probably want to set a global start_date for your tasks. This can be done by declaring your start_date directly in the DAG() object. The first DagRun to be created will be based on the min(start_date) for all your tasks.

What are operators in Airflow?

Operators are the building blocks of Airflow DAGs. They contain the logic of how data is processed in a pipeline. Each task in a DAG is defined by instantiating an operator. There are many different types of operators available in Airflow.


1 Answers

Well, they are, I would say, totally different concepts, and they can be used independently. It is true that they could both be used to prevent backfilling, but if that's your only concern then just use catchup=False. Quoting from this reply by one of the Airflow developers, in fact, it seems clear that the good practice is to use that:

As the author of LatestOnlyOperator, the goal was as a stopgap until catchup=False landed.

But he then goes on saying that LatestOnlyOperator should be deprecated. I don't agree (as a user of both catchup=False and LatestOnlyOperator) and I'll try to explain. My intuition of these two concepts is this:


Catchup = True

In a DAG definition (i.e. when you specify its default_args) you can set the flag catchup to True. If you set this flag to True and you set the DAG to ON, then the scheduler will create DAG runs for each schedule interval from the start_date to the "present" and will execute them sequentially. Quoting the documentation:

If the dag.catchup value had been True instead, the scheduler would have created a DAG Run for each completed interval between 2015-12-01 and 2016-01-02 (but not yet one for 2016-01-02, as that interval hasn’t completed) and the scheduler will execute them sequentially.


LatestOnlyOperator

A LatestOnlyOperator is an extention of the BaseOperator. Tasks made with this Operator will not run (i.e. will be skipped, and will skip also the downstream ones) if the DAG run is not in the latest schedule interval (i.e. the "last run"). Also quoting from the LatestOnlyOperator docstring:

"""
Allows a workflow to skip tasks that are not running during the most
recent schedule interval.

If the task is run outside of the latest schedule interval, all
directly downstream tasks will be skipped.

Note that downstream tasks are never skipped if the given DAG_Run is
marked as externally triggered.
"""

Conclusion

You can define your scheduled DAG with catchup=True and use LatestOnlyOperator to make sure that some tasks will not be executed during the catchup runs. Moreover LatestOnlyOperator can be used if you want to re-run some past DAG runs (for example by clearing them in the UI) but you have some tasks (like notifications being sent) that you would want to skip during those re-runs.

like image 197
UJIN Avatar answered Sep 20 '22 14:09

UJIN