Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Airflow DAG Scheduled date is a week behind

Tags:

python

airflow

I have a DAG scheduled to run at 10AM every Monday. Here is my dag definition

DAG = models.DAG(
  dag_id="etl", schedule_interval="0 10 * * 1", start_date=datetime(2018, 10, 1), 
  default_args=args
)

latest_only = LatestOnlyOperator(task_id="latest", dag=DAG)

extract = PythonOperator(
task_id="extract", python_callable=extract,  dag=DAG)

extract.set_upstream(latest_only)

It gets triggered at 10AM every Monday. It ran today(05/06/2019) but it has scheduled date as 2019-04-29 14:00:00 The task instance has the following date

execution_date : 2019-04-29T14:00:00+00:00
start_date : 2019-05-06 14:19:48.527488+00:00
end_date : 2019-05-06 14:19:54.225001+00:00

It ran fine last Monday (4/29) with the right dates and in the dag history it now shows 2 runs on 4/29. What could be causing this?

like image 781
Satish Avatar asked May 06 '19 14:05

Satish


People also ask

How do I change the DAG schedule in Airflow?

To schedule a dag, Airflow just looks for the last execution date and sum the schedule interval . If this time has expired it will run the dag. You cannot simple update the start date. A simple way to do this is edit your start date and schedule interval , rename your dag (e.g. xxxx_v2.py) and redeploy it.

How long before timing out a DagFileProcessor which processes a DAG file?

dag_file_processor_timeout: The default is 50 seconds. This is the maximum amount of time a DagFileProcessor, which processes a DAG file, can run before it times out.

How is execution date calculated in Airflow?

The execution time in Airflow is not the actual run time, but rather the start timestamp of its schedule period. For example, the execution time of the first DAG run is 2019–12–05 7:00:00, though it is executed on 2019–12–06.

What is schedule interval in DAG?

Data Interval: A property of each DAG run that represents the period of data that each task should operate on. For example, for a DAG scheduled hourly, each data interval begins at the top of the hour (minute 0) and ends at the close of the hour (minute 59).


2 Answers

There's a chapter on Scheduling in the Airflow documentation, which states:

Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.

Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.

You are experiencing exactly this: today (2019-05-06) a DagRun is created for the latest "completed" interval, meaning the week starting on 2019-04-29.

Thinking about it like this might help: if you want to process some data periodically, you need to start processing it after the data is ready for that period.

like image 146
bosnjak Avatar answered Oct 19 '22 05:10

bosnjak


Airflow schedule a dag at the ending of each interval with execution time as the starting of that interval. So usually execution_time=schedule_time-interval.

For example, in your dag, the last interval was 2019-04-29T14:00:00 to 2019-05-06T14:00:00 and its execution only get scheduled on 2019-05-06T14:00:00 with execution time as 2019-04-29T14:00:00. It is the usual working of airflow. It's not sure how your dag did run with 2019-04-29T14:00:00 before MAY 6th 2 PM, as you mentioned in your question. Maybe you changed the dag interval or made a manual trigger.

like image 27
Mohammed Sherif KK Avatar answered Oct 19 '22 07:10

Mohammed Sherif KK