I have a DAG scheduled to run at 10AM
every Monday. Here is my dag definition
DAG = models.DAG(
dag_id="etl", schedule_interval="0 10 * * 1", start_date=datetime(2018, 10, 1),
default_args=args
)
latest_only = LatestOnlyOperator(task_id="latest", dag=DAG)
extract = PythonOperator(
task_id="extract", python_callable=extract, dag=DAG)
extract.set_upstream(latest_only)
It gets triggered at 10AM
every Monday. It ran today(05/06/2019) but it has scheduled date as 2019-04-29 14:00:00
The task instance has the following date
execution_date : 2019-04-29T14:00:00+00:00
start_date : 2019-05-06 14:19:48.527488+00:00
end_date : 2019-05-06 14:19:54.225001+00:00
It ran fine last Monday (4/29) with the right dates and in the dag history it now shows 2 runs on 4/29. What could be causing this?
To schedule a dag, Airflow just looks for the last execution date and sum the schedule interval . If this time has expired it will run the dag. You cannot simple update the start date. A simple way to do this is edit your start date and schedule interval , rename your dag (e.g. xxxx_v2.py) and redeploy it.
dag_file_processor_timeout: The default is 50 seconds. This is the maximum amount of time a DagFileProcessor, which processes a DAG file, can run before it times out.
The execution time in Airflow is not the actual run time, but rather the start timestamp of its schedule period. For example, the execution time of the first DAG run is 2019–12–05 7:00:00, though it is executed on 2019–12–06.
Data Interval: A property of each DAG run that represents the period of data that each task should operate on. For example, for a DAG scheduled hourly, each data interval begins at the top of the hour (minute 0) and ends at the close of the hour (minute 59).
There's a chapter on Scheduling in the Airflow documentation, which states:
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
You are experiencing exactly this: today (2019-05-06) a DagRun is created for the latest "completed" interval, meaning the week starting on 2019-04-29.
Thinking about it like this might help: if you want to process some data periodically, you need to start processing it after the data is ready for that period.
Airflow schedule a dag at the ending of each interval with execution time as the starting of that interval. So usually execution_time=schedule_time-interval.
For example, in your dag, the last interval was 2019-04-29T14:00:00 to 2019-05-06T14:00:00 and its execution only get scheduled on 2019-05-06T14:00:00 with execution time as 2019-04-29T14:00:00. It is the usual working of airflow. It's not sure how your dag did run with 2019-04-29T14:00:00 before MAY 6th 2 PM, as you mentioned in your question. Maybe you changed the dag interval or made a manual trigger.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With