Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running airflow tasks/dags in parallel

Tags:

I'm using airflow to orchestrate some python scripts. I have a "main" dag from which several subdags are run. My main dag is supposed to run according to the following overview:

enter image description here

I've managed to get to this structure in my main dag by using the following lines:

etl_internal_sub_dag1 >> etl_internal_sub_dag2 >> etl_internal_sub_dag3 etl_internal_sub_dag3 >> etl_adzuna_sub_dag etl_internal_sub_dag3 >> etl_adwords_sub_dag etl_internal_sub_dag3 >> etl_facebook_sub_dag etl_internal_sub_dag3 >> etl_pagespeed_sub_dag  etl_adzuna_sub_dag >> etl_combine_sub_dag etl_adwords_sub_dag >> etl_combine_sub_dag etl_facebook_sub_dag >> etl_combine_sub_dag etl_pagespeed_sub_dag >> etl_combine_sub_dag 

What I want airflow to do is to first run the etl_internal_sub_dag1 then the etl_internal_sub_dag2 and then the etl_internal_sub_dag3. When the etl_internal_sub_dag3 is finished I want etl_adzuna_sub_dag, etl_adwords_sub_dag, etl_facebook_sub_dag, and etl_pagespeed_sub_dag to run in parallel. Finally, when these last four scripts are finished, I want the etl_combine_sub_dag to run.

However, when I run the main dag, etl_adzuna_sub_dag, etl_adwords_sub_dag, etl_facebook_sub_dag, and etl_pagespeed_sub_dag are run one by one and not in parallel.

Question: How do I make sure that the scripts etl_adzuna_sub_dag, etl_adwords_sub_dag, etl_facebook_sub_dag, and etl_pagespeed_sub_dag are run in parallel?

Edit: My default_args and DAG look like this:

default_args = {     'owner': 'airflow',     'depends_on_past': False,     'start_date': start_date,     'end_date': end_date,     'email': ['[email protected]'],     'email_on_failure': False,     'email_on_retry': False,     'retries': 0,     'retry_delay': timedelta(minutes=5), }  DAG_NAME = 'main_dag'  dag = DAG(DAG_NAME, default_args=default_args, catchup = False) 
like image 385
Mr. President Avatar asked Oct 10 '18 13:10

Mr. President


People also ask

Does Airflow run tasks in parallel?

Conclusion. Today you've successfully written your first Airflow DAG that runs the tasks in parallel. It's a huge milestone, especially because you can be more efficient now. Most of the time you don't need to run similar tasks one after the other, so running them in parallel is a huge time saver.

How many tasks can run in parallel Airflow?

Parallelism: This is the maximum number of tasks that can run at the same time in a single Airflow environment. If this setting is set to 32, for example, no more than 32 tasks can run concurrently across all DAGs.

How many DAGs can Airflow run at once?

concurrency : This is the maximum number of task instances allowed to run concurrently across all active DAG runs for a given DAG. This allows you to set 1 DAG to be able to run 32 tasks at once, while another DAG might only be able to run 16 tasks at once.

Can a DAG trigger another DAG Airflow?

The TriggerDagRunOperator is an easy way to implement cross-DAG dependencies. This operator allows you to have a task in one DAG that triggers another DAG in the same Airflow environment.


1 Answers

You will need to use LocalExecutor.

Check your configs (airflow.cfg), you might be using SequentialExectuor which executes tasks serially.

Airflow uses a Backend database to store metadata. Check your airflow.cfg file and look for executor keyword. By default, Airflow uses SequentialExecutor which would execute task sequentially no matter what. So to allow Airflow to run tasks in Parallel you will need to create a database in Postges or MySQL and configure it in airflow.cfg (sql_alchemy_conn param) and then change your executor to LocalExecutor in airflow.cfg and then run airflow initdb.

Note that for using LocalExecutor you would need to use Postgres or MySQL instead of SQLite as a backend database.

More info: https://airflow.incubator.apache.org/howto/initialize-database.html

If you want to take a real test drive of Airflow, you should consider setting up a real database backend and switching to the LocalExecutor. As Airflow was built to interact with its metadata using the great SqlAlchemy library, you should be able to use any database backend supported as a SqlAlchemy backend. We recommend using MySQL or Postgres.

like image 148
kaxil Avatar answered Nov 06 '22 03:11

kaxil