How to manage python packages between airflow dags?

Tags:

airflow

If I have multiple airflow dags with some overlapping python package dependencies, how can I keep each of these project deps. decoupled? Eg. if I had project A and B on same server I would run each of them with something like...

source /path/to/virtualenv_a/activate
python script_a.py
deactivate
source /path/to/virtualenv_b/activate
python script_b.py
deactivate

Basically, would like to run dags with the same situation (eg. each dag uses python scripts that have may have overlapping package deps. that I would like to develop separately (ie. not have to update all code using a package when want to update the package just for one project)). Note, I've been using the BashOperator to run python tasks like...

do_stuff = BashOperator(
        task_id='my_task',
        bash_command='python /path/to/script.py'),
        execution_timeout=timedelta(minutes=30),
        dag=dag)

Is there a way to get this working? IS there some other best-practice way that airflow intendeds for people to address (or avoid) these kinds of problems?

608

asked Oct 15 '19 23:10

lampShadesDrifter

1 Answers

Based on discussion from the apache-airflow mailing list, the simplest answer that addresses the modular way in which I am using various python scripts for tasks is to directly call virtualenv python interpreter binaries for each script or module, eg.

source /path/to/virtualenv_a/activate
python script_a.py
deactivate
source /path/to/virtualenv_b/activate
python script_b.py
deactivate

would translate to something like

do_stuff_a = BashOperator(
        task_id='my_task_a',
        bash_command='/path/to/virtualenv_a/bin/python /path/to/script_a.py'),
        execution_timeout=timedelta(minutes=30),
        dag=dag)
do_stuff_b = BashOperator(
        task_id='my_task_b',
        bash_command='/path/to/virtualenv_b/bin/python /path/to/script_b.py'),
        execution_timeout=timedelta(minutes=30),
        dag=dag)

in an airflow dag.

To the question of passing args to the Tasks, it depends on the nature of the args you want to pass in. In my case, there are certain args that depend on what a data table looks like on the day the dag is run (eg. highest timestamp record in the table, etc.). To add these args to the Tasks, I have a "congif dag" that runs before this one. In the config dag, there is a Task that generates the args for the "real" dag as a python dict and converts to a pickle file. Then the "config" dag has a Task that is a TriggerDagRunOperator that activates the "real" dag which has initial logic to read from the pickle file generated by the "config" dag (in my case, as a Dict) and I read it into that bash_command string like bash_command=f"python script.py {configs['arg1']}".

answered Sep 20 '22 22:09

lampShadesDrifter

Related questions
                            
                                Airflow configuration in environment variable not working
                            
                                How can I make sure my airflow DAG runs one day at a time?
                            
                                Can I increase the processing speed by adding more cpus to operators in Airflow?
                            
                                Airflow BigQueryOperator: how to save query result in a partitioned Table?
                            
                                How to configure Google Cloud Composer cost-effectively
                            
                                Issues after Apache Airflow migration from 1.9.0 to 1.10.1
                            
                                Apache Airflow - get all parent task_ids
                            
                                Access parent dag context at subtag creation time in airflow?
                            
                                apache airflow initdb fails at kubernetes_resource_checkingpoint for mysql
                            
                                AttributeError: 'MSVCCompiler' object has no attribute 'linker_exe'
                            
                                Airflow initdb slot_pool does not exists
                            
                                Component Gateway with DataprocOperator on Airflow
                            
                                How to configure Airflow URL in email alert
                            
                                Airflow File Sensor for sensing files on my local drive
                            
                                Airflow - Defining the key,value for a xcom_push function
                            
                                Restarting the airflow scheduler
                            
                                Why are my Airflow tasks being "externally set to failed"?
                            
                                Airflow XCOM KeyError: 'task_instance'
                            
                                How do I setup an Airflow of 2 servers?
                            
                                How to check if task 1 fail then run task 2 in airflow?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to manage python packages between airflow dags?

Tags:

airflow

lampShadesDrifter

People also ask

1 Answers

lampShadesDrifter

Recent Activity

Donate For Us