On my local machine I created a virtualenv and installed Airflow. When a dag or plugin requires a python library I pip install it into the same virtualenv.
How can I keep track of which libraries belong to a dag, and which are used for airflow itself? I recently deleted a dag and wanted to remove the libraries it was using. It was pretty time consuming, and I was crossing my fingers I didn't delete something that was being used by another dag!
Particularly for larger Airflow use-cases, I'd recommend using Airflow as a way to orchestrate tasks on a different layer of abstraction so you aren't managing dependencies from the Airflow side.
I'd recommend taking a look at either the DockerOperator or KubernetesPodOperator. With these, you can build your Python tasks into Docker containers, and have Airflow run those. That way you don't need to manage Python dependencies in Airflow, and you won't encounter any disaster scenarios where two DAGs have conflicting dependencies. This does, however, require you to be knowledgeable about managing a Kubernetes cluster.
There is airflow.operators.python_operator.PythonVirtualenvOperator
you may see about using in Dag
s where you use a PythonOperator
.
Using VirtualenvOperator
in place of PythonOperator
isolates the dependencies for a Dag
to a Virtualenv
and you can keep separate requirement files.
You may use comments in the requirements file to mark dependencies for a Dag
e.g.
package-one # Dag1.
...and when you delete the Dag
, grep requirements with the DAG's name, uninstall then delete the lines.
With this way, when you install a package for a DAG, you need a process to comment the Dag
name in your requirements file. You could write a script to perform this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With