Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Airflow task with null status

I'am having an issue with airflow when running it on a 24xlarge machine on EC2.

I must note that the parallelism level is 256.

For some days the dagrun finishes with status 'failed' for two undetermined reasons :

  1. Some task has the status 'upstream_failed', which is not true because we can see clearly that all the previous steps where successful. enter image description here

  2. Other tasks have not the status 'null', they have not started yet and they cause the dagrun to fail. enter image description here

I must note that the logs for both of these tasks are empty

enter image description here

And here is the tast instance details for these cases :

enter image description here

Any solutions please ?

like image 757
I.Chorfi Avatar asked Nov 15 '18 10:11

I.Chorfi


People also ask

What is Max_active_runs Airflow?

max_active_runs : This is the maximum number of active DAG runs allowed for the DAG in question. Once this limit is hit, the Scheduler will not create new active DAG runs.

How do I run a specific task in Airflow?

You can run a task independently by using -i/-I/-A flags along with the run command. But yes the design of airflow does not permit running a specific task and all its dependencies.

How do I create a dynamic task in Airflow?

Airflow's dynamic task mapping feature is built off of the MapReduce programming model. The map procedure takes a set of inputs and creates a single task for each one. The reduce procedure, which is optional, allows a task to operate on the collected output of a mapped task.

How do I know if a DAG is running in Airflow?

To test this, you can run airflow dags list and confirm that your DAG shows up in the list. You can also run airflow tasks list foo_dag_id --tree and confirm that your task shows up in the list as expected.


Video Answer


1 Answers

The other case where I've experienced the second condition ("Other tasks have not the status 'null'"), is when the task instance has changed, and specifically changed operator type.

I'm hoping you already got an answer / were able to move on. I've been stuck on this issue a few times in the last month, so I figured I would document what I ended up doing to solve the issue.


Example:

  • Task Instance originally is an instance of a SubDag Operator
  • Requirements cause the type of the operator to change from a SubDag Operator to a Python Operator
  • After the change, the Python Operator is set to state NULL

As best I can piece together, what's happening is:

  • Airflow is introspecting the operator associated with each task
  • Each task instance is logged into the database table task_instance
    • This table has an attribute called operator
  • When the scheduler re-introspects the code, it looks for the task_instance with the correct operator type; not seeing it, it updates the associated database record(s) as state = 'removed'
  • When the DAG subsequently schedules, airflow

You can see tasks impacted by this process with the query:

SELECT *
FROM task_instance
WHERE state = 'removed'

It looks like there's been work on this issue for airflow 1.10:

  • https://github.com/apache/airflow/pull/3137/commits/db29af4ffb3d120ad55cd089a44b99feb7b8bf38

That being said, I'm not 100% sure based on the commits that I can find that it would resolve this issue. It seems like the general philosophy is still "when a DAG changes, you should increment / change the DAG name".

I don't love that solution, because it makes it hard to iterate on what is fundamentally one pipeline. The alternative I used was to follow (partially) the recommendations from Astronomer and "blow out" the DAG history. In order to do that, you need to:

  • Stop the scheduler
  • Delete the history from the dag
    • This should result in the DAG completely disappearing from the web UI
    • If it doesn't completely disappear, somewhere the scheduler is still running
  • Restart the scheduler
    • Note: if you're running the DAG on a schedule, be prepared for it to backfill / catchup / run its latest schedule, because you've removed the history
    • If you don't want it to do this, Astronomer's "Fast Forward a DAG" suggestions could be applied
like image 126
Adam Bethke Avatar answered Nov 07 '22 07:11

Adam Bethke