Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between min_file_process_interval and dag_dir_list_interval in Apache Airflow 1.9.0?

Tags:

airflow

We are using Airflow v 1.9.0. We have 100+ dags and the instance is really slow. The scheduler is only launching some tasks.

In order to reduce the amount of CPU usage, we want to tweak some configuration parameters, namely: min_file_process_interval and dag_dir_list_interval. The documentation is not really clear about the difference between the two

like image 979
MassyB Avatar asked Jul 27 '18 12:07

MassyB


People also ask

What is the latest stable version of Airflow?

February 8, 2021 We've just released Apache Airflow 2.0. 1. We also released 61 updated and 2 new providers.

Is Start_date mandatory in Airflow DAG?

When creating a new DAG, you probably want to set a global start_date for your tasks. This can be done by declaring your start_date directly in the DAG() object. The first DagRun to be created will be based on the min(start_date) for all your tasks.

What is Start_date in Airflow DAG?

The start_date Airflow starts running tasks for a given interval at the end of the interval itself, so it will not start its first run until after 11:59 pm on 01-01-2022 or midnight on the following day (2nd Jan 2022).

What is Depends_on_past in Airflow?

According to the official Airflow docs, The task instances directly upstream from the task need to be in a success state. Also, if you have set depends_on_past=True, the previous task instance needs to have succeeded (except if it is the first run for that task).


1 Answers

min_file_process_interval:

In cases where there are only a small number of DAG definition files, the loop could potentially process the DAG definition files many times a minute. To control the rate of DAG file processing, the min_file_process_interval can be set to a higher value. This parameter ensures that a DAG definition file is not processed more often than once every min_file_process_interval seconds.

dag_dir_list_interval:

Since the scheduler can run indefinitely, it's necessary to periodically refresh the list of files in the DAG definition directory. The refresh interval is controlled with the dag_dir_list_interval configuration parameter.

Source: A Google search on both terms lead to this first result https://cwiki.apache.org/confluence/display/AIRFLOW/Scheduler+Basics

like image 171
tobi6 Avatar answered Sep 22 '22 04:09

tobi6