Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to properly handle Daylight Savings Time in Apache Airflow?

Tags:

dst

airflow

In airflow, everything is supposed to be UTC (which is not affected by DST).

However, we have workflows that deliver things based on time zones that are affected by DST.

An example scenario:

  • We have a job scheduled with a start date at 8:00 AM Eastern and a schedule interval of 24 hours.
  • Everyday at 8 AM Eastern the scheduler sees that it has been 24 hours since the last run, and runs the job.
  • DST Happens and we lose an hour.
  • Today at 8 AM Eastern the scheduler sees that it has only been 23 hours because the time on the machine is UTC, and doesn't run the job until 9AM Eastern, which is a late delivery

Is there a way to schedule dags so they run at the correct time after a time change?

like image 761
jhnclvr Avatar asked Apr 27 '17 15:04

jhnclvr


1 Answers

Off the top of my head:

If your machine is timezone-aware, set up your DAG to run at 8AM EST and 8AM EDT in UTC. Something like 0 11,12 * * *. Have the first task a ShortCircuit operator. Then use something like pytz to localize the current time. If it is within your required time, continue (IE: run the DAG). Otherwise, return False. You'll have a tiny overhead 2 extra tasks per day, but the latency should be minimal as long as your machine isn't overloaded.

sloppy example:

from datetime import datetime
from pytz import utc, timezone

# ...

def is8AM(**kwargs):
    ti = kwargs["ti"]
    curtime = utc.localize(datetime.utcnow())
    # If you want to use the exec date:
    # curtime = utc.localize(ti.execution_date)
    eastern = timezone('US/Eastern') # From docs, check your local names
    loc_dt = curtime.astimezone(eastern)
    if loc_dt.hour == 8:
        return True
    return False

start_task = ShortCircuitOperator(
                task_id='check_for_8AM',
                python_callable=is8AM,
                provide_context=True,
                dag=dag
            )

Hope this is helpful

Edit: runtimes were wrong, subtracted instead of adding. Additionally, due to how runs are launched, you'll probably end up wanting to schedule for 7AM with an hourly schedule if you want them to run at 8.

like image 166
apathyman Avatar answered Oct 21 '22 10:10

apathyman