Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Airflow: how to schedule a dag to start the day following a weekday?

How can I schedule a dag to have a weekday execution date but have a start date the following day, which is not necessarily a weekday?

My rational is that I get data at the end of each business day which I would like to process early the next morning. The airflow common pitfalls describes the execution date as the date the data belongs to while the start date is the date you run your ETL.


For example: I want a series of dag runs to have the following execution and start dates -

DAG start_date      Task Started          Task execution_date
2018-01-01          2018-01-02 Tues       2018-01-01 Mon
                    2018-01-03 Wed        2018-01-02 Tues
                    2018-01-04 Thur       2018-01-03 Wed
                    2018-01-05 Fri        2018-01-04 Thur
                    2018-01-06 Sat        2018-01-05 Fri
                    2018-01-06 Tues       2018-01-08 Mon

The closest I have managed to get to this is by using the schedule: 0 2 * * TUE-SAT which has the wrong execution date (Saturday) on when started on a Tuesday (see below)

DAG start_date      Task Started          Task execution_date
2018-01-01          2018-01-03 Wed        2018-01-02 Tues
                    2018-01-04 Thur       2018-01-03 Wed
                    2018-01-05 Fri        2018-01-04 Thur
                    2018-01-06 Sat        2018-01-05 Fri
                    2018-01-09 Tues       2018-01-06 Sat

or the schedule: 0 2 * * MON-FRI which does not run Fridays DAG till Monday and I need the results over the weekend.

DAG start_date      Task Started          Task execution_date
2018-01-01          2018-01-02 Tues       2018-01-01 Mon
                    2018-01-03 Wed        2018-01-02 Tues
                    2018-01-04 Thur       2018-01-03 Wed
                    2018-01-05 Fri        2018-01-04 Thur
                    2018-01-08 Mon        2018-01-05 Fri
                    2018-01-06 Tues       2018-01-08 Mon
like image 423
RoachLord Avatar asked Jan 28 '23 08:01

RoachLord


1 Answers

First, quoting the Airflow docs:

Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.

Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.

So what's happening here?

  1. Cron specifies periods

Specifying 0 2 * * MON-FRI means that your periods are:

MON 2AM -> TUE 2AM
TUE 2AM -> WED 2AM
WED 2AM -> THU 2AM
THU 2AM -> FRI 2AM
FRI 2AM -> MON 2AM <- the problem
  1. Airflow sets the execution date to the beginning of the period, and waits for the end of it.

This means that your desired execution date defines the end of the periods, but your desired data partition follows the start of the period.

Long story short: it's impossible to specify a periodical division of the week such that every period starts with a weekday and ends the day following day. Why? Because there's no period to represent what happens on the weekend.

How can you make a periodical division that works?

  • Simply set it to daily at 2AM, and put a conditional task in the beginning of your DAG that skips the execution if the execution date is a weekend.
  • Use 0 2 * * TUE-SAT but don't trust the execution_date to represent when your next data to be processed starts exactly, but when your past data is deemed already processed.
like image 177
villasv Avatar answered Jan 31 '23 08:01

villasv