Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to work with configuration files in Airflow

In Airflow, we've created several DAGS. Some of which share common properties, for example the directory to read files from. Currently, these properties are listed as a property in each separate DAG, which will obviously become problematic in the future. Say if the directory name was to change, we'd have to go into each DAG and update this piece of code (possibly even missing one).

I was looking into creating some sort of a configuration file, which can be parsed into Airflow and used by the various DAGS when a certain property is required, but I cannot seem to find any sort of documentation or guide on how to do this. Most I could find was the documentation on setting up Connection ID's, but that does not meet my use case.

The question to my post, is it possible to do the above scenario and how?

Thanks in advance.

like image 703
Tsume Avatar asked Aug 24 '18 13:08

Tsume


People also ask

How do I access Airflow cfg files?

The first time you run Airflow, it will create a file called airflow. cfg in your $AIRFLOW_HOME directory ( ~/airflow by default). This file contains Airflow's configuration and you can edit it to change any of the settings.

How do I run a DAG file in Airflow?

To create a DAG in Airflow, you always have to import the DAG class. After the DAG class, come the imports of Operators. Basically, for each Operator you want to use, you have to make the corresponding import. For example, you want to execute a Python function, you have to import the PythonOperator.

How do I set an environment variable in Airflow?

As per this answer, the variables should be put in /etc/default/airflow (on Debian/Ubuntu) or /etc/sysconfig/airflow (on Centos/Redhat). Show activity on this post. If you are just running a local instance you should be able to use environment variables like you expect.

How do you use variables in Airflow?

Variables in Airflow are a generic way to store and retrieve arbitrary content or settings as a simple key-value store within Airflow. Variables can be listed, created, updated, and deleted from the UI (Admin -> Variables), code, or CLI. In addition, JSON settings files can be bulk uploaded through the UI.


1 Answers

There are a few ways you can accomplish this based on your setup:

  • You can use a DagFactory type approach where you have a function generate DAGs. You can find an example of what that looks like here

  • You can store a JSON config as an Airflow Variable, and parse through that to generate a DAG. You can store something like this in a Admin -> Variables:

[{ "table": "users", "schema":"app_one", "s3_bucket":"etl_bucket", "s3_key":"app_one_users", "redshift_conn_id":"postgres_default" }, { "table": "users", "schema":"app_two", "s3_bucket":"etl_bucket", "s3_key":"app_two_users", "redshift_conn_id":"postgres_default"}]

Your DAG could get generated as:

sync_config = json.loads(Variable.get("sync_config"))

with dag:
    start = DummyOperator(task_id='begin_dag')
    for table in sync_config:
        d1 = RedshiftToS3Transfer(
            task_id='{0}'.format(table['s3_key']),
            table=table['table'],
            schema=table['schema'],
            s3_bucket=table['s3_bucket'],
            s3_key=table['s3_key'],
            redshift_conn_id=table['redshift_conn_id']
        )
        start >> d1

Similarly, you can just store that config as a local file and open it as you would any other file. Keep in mind the best answer to this will depend on your infrastructure and use case.

like image 137
Viraj Parekh Avatar answered Oct 11 '22 07:10

Viraj Parekh