Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

airflow - how to 'Filling up the DagBag' once only

My dag takes about 50seconds to parse, I only use external triggers to start dag runs, no schedules. I notice airflow wants to fill the dagbag a lot --> On every trigger_dag command AND in the background it keeps checking the dags folder AND creating .pyc files seemingly instantly once new .py deployed.

Is there anyway I can deploy my cluster and get dags filled once! Then for the next 2 weeks get dagruns starting instantly on any trigger_dag (right now takes 50 seconds just to fill the dagbag before starting). I have no need to update dag definitions within the 2 weeks.

like image 274
tooptoop4 Avatar asked Apr 25 '19 15:04

tooptoop4


People also ask

Is Start_date mandatory in Airflow DAG?

When creating a new DAG, you probably want to set a global start_date for your tasks. This can be done by declaring your start_date directly in the DAG() object. The first DagRun to be created will be based on the min(start_date) for all your tasks.

How often does Airflow check for new DAGs?

Airflow scans the dags_folder for new DAGs every dag_dir_list_interval , which defaults to 5 minutes but can be modified. You might have to wait until this interval has passed before a new DAG appears in the UI.

What is depends on past in Airflow?

According to the official Airflow docs, The task instances directly upstream from the task need to be in a success state. Also, if you have set depends_on_past=True, the previous task instance needs to have succeeded (except if it is the first run for that task).

What is Max active runs in Airflow?

concurrency : This is the maximum number of task instances allowed to run concurrently across all active DAG runs for a given DAG. This allows you to set 1 DAG to be able to run 32 tasks at once, while another DAG might only be able to run 16 tasks at once.


1 Answers

50 seconds is an incredibly huge amount of time for DAG instantiation. Looks like you are using a big piece of code (or just long-working) in your DAG file. It is very bad practice:

Note: This means all top level code (ie. anything that isn't defining the DAG) in a DAG file will get run each scheduler heartbeat. Try to avoid top level code to your DAG file unless absolutely necessary.

Airflow works exactly as you described. It is why you should treat your Python files in your DAG folder mostly as configuration files (with some programmatical capabilities). You can't change it with any magic config keys or something like it. This behaviour is the core of Airflow.

like image 195
vurmux Avatar answered Sep 21 '22 09:09

vurmux