Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

airflow trigger_dag execution_date is the next day, why?

Tags:

Recently I have tested airflow so much that have one problem with execution_date when running airflow trigger_dag <my-dag>.

I have learned that execution_date is not what we think at first time from here:

Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if I want to summarize data for 2016-02-19, I would do it at 2016-02-20 midnight GMT, which would be right after all data for 2016-02-19 becomes available.

start_date = datetime.combine(datetime.today(),                               datetime.min.time())  args = {     "owner": "xigua",     "start_date": start_date } dag = DAG(dag_id="hadoopprojects", default_args=args,           schedule_interval=timedelta(days=1))   wait_5m = ops.TimeDeltaSensor(task_id="wait_5m",                               dag=dag,                               delta=timedelta(minutes=5)) 

Above codes is the start part of my daily workflow, the first task is a TimeDeltaSensor that waits another 5 minutes before actual work, so this means my dag will be triggered at 2016-09-09T00:05:00, 2016-09-10T00:05:00... etc.

In Web UI, I can see something like scheduled__2016-09-20T00:00:00, and task is run at 2016-09-21T00:00:00, which seems reasonable according to ETL model.

However someday my dag is not triggered for unknown reason, so I trigger it manually, if I trigger it at 2016-09-20T00:10:00, then the TimeDeltaSensor will wait until 2016-09-21T00:15:00 before run.

This is not what I want, I want it to run at 2016-09-20T00:15:00 not the next day, I have tried passing execution_date through --conf '{"execution_date": "2016-09-20"}', but it doesn't work.

How should I deal with this issue ?

$ airflow version [2016-09-21 17:26:33,654] {__init__.py:36} INFO - Using executor LocalExecutor   ____________       _____________  ____    |__( )_________  __/__  /________      __ ____  /| |_  /__  ___/_  /_ __  /_  __ \_ | /| / / ___  ___ |  / _  /   _  __/ _  / / /_/ /_ |/ |/ /  _/_/  |_/_/  /_/    /_/    /_/  \____/____/|__/    v1.7.1.3 
like image 663
Jiacai Liu Avatar asked Sep 21 '16 09:09

Jiacai Liu


People also ask

What is Airflow execution date and start date?

The execution time in Airflow is not the actual run time, but rather the start timestamp of its schedule period. For example, the execution time of the first DAG run is 2019–12–05 7:00:00, though it is executed on 2019–12–06.

How do I change the schedule interval on Airflow?

To schedule a dag, Airflow just looks for the last execution date and sum the schedule interval . If this time has expired it will run the dag. You cannot simple update the start date. A simple way to do this is edit your start date and schedule interval , rename your dag (e.g. xxxx_v2.py) and redeploy it.

What is Airflow logical date?

The “logical date” (also called execution_date in Airflow versions prior to 2.2) of a DAG run, for example, denotes the start of the data interval, not when the DAG is actually executed.


2 Answers

First, I recommend you use constants for start_date, because dynamic ones would act unpredictably based on with your airflow pipeline is evaluated by the scheduler.

More information about start_date here in an FAQ entry that I wrote and sort all this out: https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date

Now, about execution_date and when it is triggered, this is a common gotcha for people onboarding on Airflow. Airflow sets execution_date based on the left bound of the schedule period it is covering, not based on when it fires (which would be the right bound of the period). When running an schedule='@hourly' task for instance, a task will fire every hour. The task that fires at 2pm will have an execution_date of 1pm because it assumes that you are processing the 1pm to 2pm time window at 2pm. Similarly, if you run a daily job, the run an with execution_date of 2016-01-01 would trigger soon after midnight on 2016-01-02.

This left-bound labelling makes a lot of sense when thinking in terms of ETL and differential loads, but gets confusing when thinking in terms of a simple, cron-like scheduler.

like image 163
mistercrunch Avatar answered Oct 15 '22 11:10

mistercrunch


Airflow will provide the time in UTC. I am not sure at what timezone you are running the tasks. So make sure you think of UTC timezone and schedule or trigger the jobs accordingly.

Try converting the time you want to trigger to UTC time and trigger the DAG. it works. For more information, you can read the below link

https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls

like image 21
vijay krishna Avatar answered Oct 15 '22 11:10

vijay krishna