We've been converting our cron jobs over to Airflow DAGs and I am having difficulties figuring out exactly how the scheduling of DAGs works in Airflow. Some DAGs need to run at specific times of the day (ie 7am), other DAGs need to run at a specific day/time of the month (ie 6am on the 15th of every month).
Generally, Airflow seems to be running daily DAGs correctly.  So, schedule_interval = '0 7 * * * with 'start_date': datetime(2017,4,7) runs everyday at 7am. 
However, for a monthly DAG (schedule_interval = '0 6 15 * *' and 'start_date': datetime(2017,4,7)) it ran on April 15 at 6am, but didn't hasn't run since then.  Other DAGs I've tried to schedule monthly similarly fail to run after the first month.
Airflow's documentation on scheduling is, IMO, muddy and answers to other SO questions have made me more confused. I'm hoping someone out there can clarify what is going wrong with my understanding and the DAGs I'm trying to schedule monthly.
The Airflow monthly run scheduling, while consistent with its daily scheduling, is confusing. As a result, a monthly DAG runs about a month later than you might expect. For example, if I schedule a DAG to run on the first of the month at midnight (e.g. 0 0 1 * *), the run with execution_date 2018-04-01 will actually run just after 2018-05-01 at midnight. This is because Airflow waits for the execution period to finish before running. I think the idea is that the monthly execution of 2018-04-01 represents data for the entire period of 2018-04-01 to 2018-05-01.
You'll need to restructure your schedules with this concept in mind.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With