Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is it recommended against using a dynamic start_date in Airflow?

Tags:

airflow

I've read Airflow's FAQ about "What's the deal with start_date?", but it still isn't clear to me why it is recommended against using dynamic start_date.

To my understanding, a DAG's execution_date is determined by the minimum start_date between all of the DAG's tasks, and subsequent DAG Runs are ran at the latest execution_date + schedule_interval.

If I set my DAG's default_args start_date to be for, say, yesterday at 20:00:00, with a schedule_interval of 1 day, how would that break or confuse the scheduler, if at all? If I understand correctly, the scheduler would trigger the DAG with an execution_date of yesterday at 20:00:00, and the next DAG Run would be scheduled for today at 20:00:00.

Is there some concept that I'm missing?

like image 509
astronomotrous Avatar asked Dec 14 '16 03:12

astronomotrous


People also ask

What is dynamic DAG Airflow?

Dynamic DAGs with environment variablesUsing Airflow Variables at top-level code creates a connection to metadata DB of Airflow to fetch the value, which can slow down parsing and place extra load on the DB. See the Airflow Variables on how to make best use of Airflow Variables in your DAGs using Jinja templates .

What is Start_date in Airflow DAG?

Similarly, since the start_date argument for the DAG and its tasks points to the same logical date, it marks the start of the DAG's first data interval, not when tasks in the DAG will start running. In other words, a DAG run will only be scheduled one interval after start_date .

How can you improve Airflow performance?

One can take a different approach by increasing the number of threads available on the machine that runs the scheduler process so that the max_threads parameter can be set to a higher value. With a higher value, the Airflow scheduler will be able to more effectively process the increased number of DAGs.


1 Answers

First run would be at start_date+schedule_interval. It doesn't run dag on start_date, it always runs on start_date+schedule_interval.

As they mentioned in document if you give start_date dynamic for e.g. datetime.now() and give some schedule_interval(1 hour), it will never execute that run as now() moves along with time and datetime.now()+ 1 hour is not possible

like image 176
liferacer Avatar answered Sep 18 '22 15:09

liferacer