I am trying to help my team of data scientist run their code using airflow. The problem i faced is that their python scripts will read/write some intermediate files.
1) Is there anyway to set the working directory where their scripts and files can exist so that it will not clutter the dags folder?
2) even if i use the dag folder, I would have to specify the absolute path everytime i read/write those files. unless there is some other way around this?
i.e. i would have to do this all the time:-
absolute_path="/some/long/directory/path"
f = os.path.join(absolute_path,file_name)
What I do is have a separate folder with all the modules needed to be run and I add that to the airflow run environment.
PATH_MODULES = "/home/airflow-worker-1/airflow_modules/"
sys.path += [ PATH_MODULES ]
This way, I can import any functions in those folders ( provided that they have __init__.py
because they are treated as packages.
airflow_modules
|_ code_repository_1
|_ code_repository_2
|_ code_repository_3
|_ file_1.py
|_ config.py
So in your DAG code you use:
from code_repository_1.data_cleaning import clean_1
from code_repository_2.bigquery_operations import operation_1
One thing to keep in mind is that since this is treating the repositories as projects so if you need file_1.py
to import a variable from config.py
, then you can have to use the relative import with from .config import variable_1
.
you can use the os module to do this. if you put something like this section of code at the top of your dag file:
import os
os.chdir('/home/lnx/test/')
it will change the working directory for all tasks running in the dag to /home/lnx/test
so you wouldn't have to provide absolute paths. It will however need to be included at the top of every dag that requires this working directory.
Although this will be a late answer hopefully it can help someone else in this position.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With