I don't understand why do we need a 'start_date' for the operators(task instances). Shouldn't the one that we pass to the DAG suffice?
Also, if the current time is 7th Feb 2018 8.30 am UTC, and now I set the start_date of the dag to 7th Feb 2018 0.00 am with my cron expression for schedule interval being 30 9 * * * (daily at 9.30 am, i.e expecting to run in next 1 hour). Will my DAG run today at 9.30 am or tomorrow (8th Feb at 9.30 am )?
Similarly, since the start_date argument for the DAG and its tasks points to the same logical date, it marks the start of the DAG's first data interval, not when tasks in the DAG will start running. In other words, a DAG run will only be scheduled one interval after start_date .
Airflow uses Xcoms to pass data between operators. If the flow is operator A -> operator B, then operator A must "push" a value to xcom, and operator B must "pull" this value from A if it wants to read it.
Some common operators available in Airflow are: BashOperator – used to execute bash commands on the machine it runs on. PythonOperator – takes any python function as an input and calls the same (this means the function should have a specific signature as well) EmailOperator – sends emails using SMTP server configured.
class airflow.operators.dummy. DummyOperator(**kwargs)[source] Operator that does literally nothing. It can be used to group tasks in a DAG. The task is evaluated by the scheduler but never processed by the executor.
Regarding start_date on task instance, personally I have never used this, I always just have a single DAG start_date.
However from what I can see this would allow you to specify certain tasks to start at a different time from the main DAG. It appears this is a legacy feature and from reading the FAQ they recommend using time sensors for that type of thing instead and just having one start_date for all tasks passed through the DAG.
Your second question:
The execution date for a run is always the previous period based on your schedule.
From the docs (Airflow Docs)
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
To clarify:
Some complex requirements may need specific timings at the task level. For example, I may want my DAG to run each day for a full week before some aggregation logging task starts running, so to achieve this I could set different start dates at the task level.
A bit more useful info... looking through the airflow DAG
class source it appears that setting the start_date
at the DAG level simply means it is passed through to the task when no default value for task start_date was passed in to the DAG via the default_args
dict, or when no specific start_date
is are defined on a per task level. So for any case where you want all tasks in a DAG to kick off at the same time (dependencies aside), setting start_date
at the DAG level is sufficient.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With