Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to consider daylight savings time when using cron schedule in Airflow

Tags:

airflow

In Airflow, I'd like a job to run at specific time each day in a non-UTC timezone. How can I go about scheduling this?

The problem is that once daylight savings time is triggered, my job will either be running an hour too soon or an hour too late. In the Airflow docs, it seems like this is a known issue:

In case you set a cron schedule, Airflow assumes you will always want to run at the exact same time. It will then ignore day light savings time. Thus, if you have a schedule that says run at end of interval every day at 08:00 GMT+1 it will always run end of interval 08:00 GMT+1, regardless if day light savings time is in place.

Has anyone else run into this issue? Is there a work around? Surely the best practice cannot be to alter all the scheduled times after Daylight Savings Time occurs?

Thanks.

like image 994
Scott Skiles Avatar asked Jan 28 '23 08:01

Scott Skiles


1 Answers

Starting with Airflow 1.10, time-zone aware DAGs can be defined using time-zone aware datetime objects to specify start_date. For Airflow to schedule DAG runs always at the same time (regardless of a possible daylight-saving-time switch), use cron expressions to specify schedule_interval. To make Airflow schedule DAG runs with fixed intervals (regardless of a possible daylight-saving-time switch), use datetime.timedelta() to specify schedule_interval.

For example, consider the following code that, first, uses a cron expression to schedule two consecutive DAG runs, and then uses a fixed interval to do the same.

import pendulum
from airflow import DAG
from datetime import datetime, timedelta

START_DATE = datetime(
    year=2019,
    month=10,
    day=25,
    hour=8,
    minute=0,
    tzinfo=pendulum.timezone('Europe/Kiev'),
)


def gen_execution_dates(start_date, schedule_interval):
    dag = DAG(
        dag_id='id', start_date=start_date, schedule_interval=schedule_interval
    )
    execution_date = dag.start_date
    for i in range(1, 3):
        execution_date = dag.following_schedule(execution_date)
        print(
            f'[Run {i}: Execution Date for "{schedule_interval}"]:',
            dag.timezone.convert(execution_date),
        )


gen_execution_dates(START_DATE, '0 8 * * *')
gen_execution_dates(START_DATE, timedelta(days=1))

Running the code produces the following output:

[Run 1: Execution Date for "0 8 * * *"]: 2019-10-26 08:00:00+03:00
[Run 2: Execution Date for "0 8 * * *"]: 2019-10-27 08:00:00+02:00
[Run 1: Execution Date for "1 day, 0:00:00"]: 2019-10-26 08:00:00+03:00
[Run 2: Execution Date for "1 day, 0:00:00"]: 2019-10-27 07:00:00+02:00

For the zone [Europe/Kiev], the daylight saving time of 2019 ends on 2019-10-27 at 03:00:00+03:00. That is, between Run 1 and Run 2 in our example.

The first two output lines show that for the DAG runs scheduled with a cron expression the first run and second run are both scheduled for 08:00 (although, in different timezones: Eastern European Summer Time (EEST) and Eastern European Time (EET) respectively).

The last two output lines show that for the DAG runs scheduled with a fixed interval the first run is scheduled for 08:00 (EEST), and the second run is scheduled exactly 1 day (24 hours) later, which is at 07:00 (EET) due to the daylight-saving-time switch.

The following figure illustrates the example:

enter image description here

like image 87
SergiyKolesnikov Avatar answered Feb 16 '23 16:02

SergiyKolesnikov