Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python script scheduling in airflow

Hi everyone,

I need to schedule my python files(which contains data extraction from sql and some joins) using airflow. I have successfully installed airflow into my linux server and webserver of airflow is available with me. But even after going through documentation I am not clear where exactly I need to write script for scheduling and how will that script be available into airflow webserver so I could see the status

As far as the configuration is concerned I know where the dag folder is located in my home directory and also where example dags are located.

Note: Please dont mark this as duplicate with How to run bash script file in Airflow as I need to run python files lying in some different location.

Please find the configuration in Airflow webserver as :

enter image description here

Below is the screenshot of dag folder in AIRFLOW_HOME dir

enter image description here

Also find the below screenshot for DAG creation screenshot and Missing DAG error

enter image description here

enter image description here

After i select the simple DAG following error of missing DAG is populated

enter image description here

like image 470
Abhishek Pansotra Avatar asked Jan 18 '17 22:01

Abhishek Pansotra


People also ask

How do I schedule a Python script using Airflow?

Now to schedule Python scripts with Apache Airflow, open up the dags folder where your Airflow is installed or create a folder called “dags” in there. This will be the place where all your dags, or, python scripts will be. Once you have it, create a file in there ending with a .

How do I run Python code in Airflow DAG?

In order to create a Python DAG in Airflow, you must always import the required Python DAG class. Following the DAG class are the Operator imports. Basically, you must import the corresponding Operator for each one you want to use. To execute a Python function, for example, you must import the PythonOperator.

How do I run a scheduler in Airflow?

The Airflow scheduler is designed to run as a persistent service in an Airflow production environment. To kick it off, all you need to do is execute the airflow scheduler command. It uses the configuration specified in airflow. cfg .

Can Airflow run Python?

Apache Airflow is an open-source workflow management platform for building Data Pipelines. It enables users to schedule and run Data Pipelines using the flexible Python Operators and framework.


3 Answers

You should probably use the PythonOperator to call your function. If you want to define the function somewhere else, you can simply import it from a module as long as it's accessible in your PYTHONPATH.

from airflow import DAG
from airflow.operators.python_operator import PythonOperator

from my_script import my_python_function

dag = DAG('tutorial', default_args=default_args)

PythonOperator(dag=dag,
               task_id='my_task_powered_by_python',
               provide_context=False,
               python_callable=my_python_function,
               op_args=['arguments_passed_to_callable'],
               op_kwargs={'keyword_argument':'which will be passed to function'})

If your function my_python_function was in a script file /path/to/my/scripts/dir/my_script.py

Then before starting Airflow, you could add the path to your scripts to the PYTHONPATH like so:

export PYTHONPATH=/path/to/my/scripts/dir/:$PYTHONPATH

More information here: https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html

Default args and other considerations as in the tutorial: https://airflow.apache.org/docs/apache-airflow/stable/tutorial.html

like image 183
postrational Avatar answered Oct 14 '22 02:10

postrational


You can also use bashoperator to execute python scripts in Airflow. You can put your scripts in a folder in DAG folder. If your scripts are somewhere else, just give a path to those scripts.

    from airflow import DAG
    from airflow.operators import BashOperator,PythonOperator
    from datetime import datetime, timedelta

    seven_days_ago = datetime.combine(datetime.today() - timedelta(7),
                                      datetime.min.time())

    default_args = {
        'owner': 'airflow',
        'depends_on_past': False,
        'start_date': seven_days_ago,
        'email': ['[email protected]'],
        'email_on_failure': False,
        'email_on_retry': False,
        'retries': 1,
        'retry_delay': timedelta(minutes=5),
      }

    dag = DAG('simple', default_args=default_args)
t1 = BashOperator(
    task_id='testairflow',
    bash_command='python /home/airflow/airflow/dags/scripts/file1.py',
    dag=dag)
like image 34
liferacer Avatar answered Oct 14 '22 01:10

liferacer


Airflow parses all Python files in $AIRFLOW_HOME/dags (in your case /home/amit/airflow/dags). And that python script should retrun a DAG object back as shown in answer from "postrational". Now when it is being reported as missing that means there is some issue in Python code and Airflow could not load it. Check airflow webserver or scheduler logs for more details, as stderr or stdout goes there.

like image 1
user7126545 Avatar answered Oct 14 '22 01:10

user7126545