Recently I have tested airflow so much that have one problem with execution_date
when running airflow trigger_dag <my-dag>
.
I have learned that execution_date
is not what we think at first time from here:
Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if I want to summarize data for 2016-02-19, I would do it at 2016-02-20 midnight GMT, which would be right after all data for 2016-02-19 becomes available.
start_date = datetime.combine(datetime.today(), datetime.min.time()) args = { "owner": "xigua", "start_date": start_date } dag = DAG(dag_id="hadoopprojects", default_args=args, schedule_interval=timedelta(days=1)) wait_5m = ops.TimeDeltaSensor(task_id="wait_5m", dag=dag, delta=timedelta(minutes=5))
Above codes is the start part of my daily workflow, the first task is a TimeDeltaSensor that waits another 5 minutes before actual work, so this means my dag will be triggered at 2016-09-09T00:05:00
, 2016-09-10T00:05:00
... etc.
In Web UI, I can see something like scheduled__2016-09-20T00:00:00
, and task is run at 2016-09-21T00:00:00
, which seems reasonable according to ETL
model.
However someday my dag is not triggered for unknown reason, so I trigger it manually, if I trigger it at 2016-09-20T00:10:00
, then the TimeDeltaSensor will wait until 2016-09-21T00:15:00
before run.
This is not what I want, I want it to run at 2016-09-20T00:15:00
not the next day, I have tried passing execution_date
through --conf '{"execution_date": "2016-09-20"}'
, but it doesn't work.
How should I deal with this issue ?
$ airflow version [2016-09-21 17:26:33,654] {__init__.py:36} INFO - Using executor LocalExecutor ____________ _____________ ____ |__( )_________ __/__ /________ __ ____ /| |_ /__ ___/_ /_ __ /_ __ \_ | /| / / ___ ___ | / _ / _ __/ _ / / /_/ /_ |/ |/ / _/_/ |_/_/ /_/ /_/ /_/ \____/____/|__/ v1.7.1.3
The execution time in Airflow is not the actual run time, but rather the start timestamp of its schedule period. For example, the execution time of the first DAG run is 2019–12–05 7:00:00, though it is executed on 2019–12–06.
To schedule a dag, Airflow just looks for the last execution date and sum the schedule interval . If this time has expired it will run the dag. You cannot simple update the start date. A simple way to do this is edit your start date and schedule interval , rename your dag (e.g. xxxx_v2.py) and redeploy it.
The “logical date” (also called execution_date in Airflow versions prior to 2.2) of a DAG run, for example, denotes the start of the data interval, not when the DAG is actually executed.
First, I recommend you use constants for start_date
, because dynamic ones would act unpredictably based on with your airflow pipeline is evaluated by the scheduler.
More information about start_date
here in an FAQ entry that I wrote and sort all this out: https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
Now, about execution_date
and when it is triggered, this is a common gotcha for people onboarding on Airflow. Airflow sets execution_date
based on the left bound of the schedule period it is covering, not based on when it fires (which would be the right bound of the period). When running an schedule='@hourly'
task for instance, a task will fire every hour. The task that fires at 2pm will have an execution_date
of 1pm because it assumes that you are processing the 1pm to 2pm time window at 2pm. Similarly, if you run a daily job, the run an with execution_date
of 2016-01-01
would trigger soon after midnight on 2016-01-02
.
This left-bound labelling makes a lot of sense when thinking in terms of ETL and differential loads, but gets confusing when thinking in terms of a simple, cron-like scheduler.
Airflow will provide the time in UTC. I am not sure at what timezone you are running the tasks. So make sure you think of UTC timezone and schedule or trigger the jobs accordingly.
Try converting the time you want to trigger to UTC time and trigger the DAG. it works. For more information, you can read the below link
https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With