Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problem with start date and scheduled date in Apache airflow

I am working with Apache Airflow and I have a problem with the scheduled day and the starting day.

I want a DAG to run every day at 8:00 AM UTC. So, I did:

default_args = {
        'owner': 'airflow',
        'depends_on_past': False,
        'start_date': datetime(2020, 12, 7, 10, 0,0),
        'email': ['[email protected]'],
        'email_on_failure': True,
        'email_on_retry': False,
        'retries': 1,
        'retry_delay': timedelta(hours=5)
    }
# Never run
dag = DAG(dag_id='id', default_args=default_args, schedule_interval='0 8 * * *',catchup=True)

The day I upload the DAG was 2020-12-07 and I wanted to run it on 2020-12-08 at 08:00:00.

I set the start_date at 2020-12-07 at 10:00:00 to avoid running it at 2020-12-07 at 08:00:00 and only trigger it the next day, but it didn't work.

Then I modified the starting day:

default_args = {
        'owner': 'airflow',
        'depends_on_past': False,
        'start_date': datetime(2020, 12, 7, 7, 59,0),
        'email': ['[email protected]'],
        'email_on_failure': True,
        'email_on_retry': False,
        'retries': 1,
        'retry_delay': timedelta(hours=5)
    }
# Never run
dag = DAG(dag_id='etl-ca-cpke-spark_dev_databricks', default_args=default_args, schedule_interval='0 8 * * *',catchup=True)

Now the start date is 1 minute before the DAG should run, and indeed, because the catchup is set to True, the DAG has been triggered for 2020-12-07 at 08:00:00, but it has not being triggered for 2020-12-08 at 08:00:00.

Why?

like image 493
J.C Guzman Avatar asked Dec 08 '20 09:12

J.C Guzman


People also ask

Is Start_date mandatory in Airflow DAG?

When creating a new DAG, you probably want to set a global start_date for your tasks. This can be done by declaring your start_date directly in the DAG() object. The first DagRun to be created will be based on the min(start_date) for all your tasks.

What is start date in Airflow DAG?

The start date is the date at which your DAG starts being scheduled. This date can be in the past or in the future. Think of the start date as the start of the data interval you want to process. For example, the 01/01/2021 00:00. In addition to the start date, you need a schedule interval.

How do I start an Airflow scheduler?

The Airflow scheduler is designed to run as a persistent service in an Airflow production environment. To kick it off, all you need to do is execute the airflow scheduler command. It uses the configuration specified in airflow. cfg .

What is Airflow logical date?

The “logical date” (also called execution_date in Airflow versions prior to 2.2) of a DAG run, for example, denotes the start of the data interval, not when the DAG is actually executed.

What is the start_date and execution_date in airflow?

As a scheduler, date and time are very imperative components. In Airflow, there are two dates you’d need to put extra effort to digest: execution_date and start_date . Note the start_date is not the same as the date you defined in the previous DAG. execution_date is the start date and time when you expect a DAG to be triggered.

When does the airflow scheduler run the 04–09 execution?

When does the Airflow scheduler run the 04–09 execution? It waits until 04–10 02:00:00 (wall clock). Once the 04–09 execution has been triggered, you’d see execution_date as 04–09T02:00:00 and start_date would be something like 04–10T02:01:15 (this varies as Airflow decides when to trigger the task, and we’ll cover more in next section).

How to set the schedule interval for airflow?

The schedule interval that you set up would be the same as your Airflow infrastructure setup. How to set the Airflow schedule interval? You probably familiar with the syntax of defining a DAG, and usually implement both start_date and scheduler_interval under the args in the DAG class.

How does airflow know when to schedule a DAG’s next run?

When Airflow’s scheduler encounters a DAG, it calls one of the two methods to know when to schedule the DAG’s next run. next_dagrun_info: The scheduler uses this to learn the timetable’s regular schedule, i.e. the “one for every workday, run at the end of it” part in our example.


Video Answer


1 Answers

Airflow schedules tasks at the end of the interval (See documentation reference)

Meaning that when you do:

start_date: datetime(2020, 12, 7, 8, 0,0)
schedule_interval: '0 8 * * *'

The first run will kick in at 2020-12-08 at 08:00+- (depends on resources)

This run's execution_date will be: 2020-12-07 08:00

The next run will kick in at 2020-12-09 at 08:00

This run's execution_date of 2020-12-08 08:00.

Since today is 2020-12-08 the next run didn't kick in because it's not the end of the interval yet.

like image 111
Elad Kalif Avatar answered Sep 29 '22 09:09

Elad Kalif