I am working with Apache Airflow and I have a problem with the scheduled day and the starting day.
I want a DAG to run every day at 8:00 AM UTC. So, I did:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 12, 7, 10, 0,0),
'email': ['[email protected]'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(hours=5)
}
# Never run
dag = DAG(dag_id='id', default_args=default_args, schedule_interval='0 8 * * *',catchup=True)
The day I upload the DAG was 2020-12-07 and I wanted to run it on 2020-12-08 at 08:00:00.
I set the start_date at 2020-12-07 at 10:00:00 to avoid running it at 2020-12-07 at 08:00:00 and only trigger it the next day, but it didn't work.
Then I modified the starting day:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 12, 7, 7, 59,0),
'email': ['[email protected]'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(hours=5)
}
# Never run
dag = DAG(dag_id='etl-ca-cpke-spark_dev_databricks', default_args=default_args, schedule_interval='0 8 * * *',catchup=True)
Now the start date is 1 minute before the DAG should run, and indeed, because the catchup is set to True, the DAG has been triggered for 2020-12-07 at 08:00:00, but it has not being triggered for 2020-12-08 at 08:00:00.
Why?
When creating a new DAG, you probably want to set a global start_date for your tasks. This can be done by declaring your start_date directly in the DAG() object. The first DagRun to be created will be based on the min(start_date) for all your tasks.
The start date is the date at which your DAG starts being scheduled. This date can be in the past or in the future. Think of the start date as the start of the data interval you want to process. For example, the 01/01/2021 00:00. In addition to the start date, you need a schedule interval.
The Airflow scheduler is designed to run as a persistent service in an Airflow production environment. To kick it off, all you need to do is execute the airflow scheduler command. It uses the configuration specified in airflow. cfg .
The “logical date” (also called execution_date in Airflow versions prior to 2.2) of a DAG run, for example, denotes the start of the data interval, not when the DAG is actually executed.
As a scheduler, date and time are very imperative components. In Airflow, there are two dates you’d need to put extra effort to digest: execution_date and start_date . Note the start_date is not the same as the date you defined in the previous DAG. execution_date is the start date and time when you expect a DAG to be triggered.
When does the Airflow scheduler run the 04–09 execution? It waits until 04–10 02:00:00 (wall clock). Once the 04–09 execution has been triggered, you’d see execution_date as 04–09T02:00:00 and start_date would be something like 04–10T02:01:15 (this varies as Airflow decides when to trigger the task, and we’ll cover more in next section).
The schedule interval that you set up would be the same as your Airflow infrastructure setup. How to set the Airflow schedule interval? You probably familiar with the syntax of defining a DAG, and usually implement both start_date and scheduler_interval under the args in the DAG class.
When Airflow’s scheduler encounters a DAG, it calls one of the two methods to know when to schedule the DAG’s next run. next_dagrun_info: The scheduler uses this to learn the timetable’s regular schedule, i.e. the “one for every workday, run at the end of it” part in our example.
Airflow schedules tasks at the end of the interval (See documentation reference)
Meaning that when you do:
start_date: datetime(2020, 12, 7, 8, 0,0)
schedule_interval: '0 8 * * *'
The first run will kick in at 2020-12-08
at 08:00
+- (depends on resources)
This run's execution_date
will be: 2020-12-07 08:00
The next run will kick in at 2020-12-09
at 08:00
This run's execution_date
of 2020-12-08 08:00
.
Since today is 2020-12-08
the next run didn't kick in because it's not the end of the interval yet.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With