Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Airflow: Why is there a start_date for operators?

I don't understand why do we need a 'start_date' for the operators(task instances). Shouldn't the one that we pass to the DAG suffice?

Also, if the current time is 7th Feb 2018 8.30 am UTC, and now I set the start_date of the dag to 7th Feb 2018 0.00 am with my cron expression for schedule interval being 30 9 * * * (daily at 9.30 am, i.e expecting to run in next 1 hour). Will my DAG run today at 9.30 am or tomorrow (8th Feb at 9.30 am )?

like image 884
soupybionics Avatar asked Feb 07 '18 09:02

soupybionics


People also ask

What is Start_date in Airflow DAG?

Similarly, since the start_date argument for the DAG and its tasks points to the same logical date, it marks the start of the DAG's first data interval, not when tasks in the DAG will start running. In other words, a DAG run will only be scheduled one interval after start_date .

How do you pass data between operators in Airflow?

Airflow uses Xcoms to pass data between operators. If the flow is operator A -> operator B, then operator A must "push" a value to xcom, and operator B must "pull" this value from A if it wants to read it.

What are the operators used in Airflow?

Some common operators available in Airflow are: BashOperator – used to execute bash commands on the machine it runs on. PythonOperator – takes any python function as an input and calls the same (this means the function should have a specific signature as well) EmailOperator – sends emails using SMTP server configured.

What is dummy operator in Airflow?

class airflow.operators.dummy. DummyOperator(**kwargs)[source] Operator that does literally nothing. It can be used to group tasks in a DAG. The task is evaluated by the scheduler but never processed by the executor.


2 Answers

Regarding start_date on task instance, personally I have never used this, I always just have a single DAG start_date.

However from what I can see this would allow you to specify certain tasks to start at a different time from the main DAG. It appears this is a legacy feature and from reading the FAQ they recommend using time sensors for that type of thing instead and just having one start_date for all tasks passed through the DAG.

Your second question:

The execution date for a run is always the previous period based on your schedule.

From the docs (Airflow Docs)

Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.

To clarify:

  • If set on a daily schedule, on the 8th it will execute the 7th.
  • If set to a weekly schedule to run on a Sunday, the execution date for this Sunday would be last Sunday.
like image 173
Blakey Avatar answered Sep 28 '22 14:09

Blakey


Some complex requirements may need specific timings at the task level. For example, I may want my DAG to run each day for a full week before some aggregation logging task starts running, so to achieve this I could set different start dates at the task level.

A bit more useful info... looking through the airflow DAG class source it appears that setting the start_date at the DAG level simply means it is passed through to the task when no default value for task start_date was passed in to the DAG via the default_args dict, or when no specific start_date is are defined on a per task level. So for any case where you want all tasks in a DAG to kick off at the same time (dependencies aside), setting start_date at the DAG level is sufficient.

like image 35
Jinglesting Avatar answered Sep 28 '22 14:09

Jinglesting