Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

airflow cleared tasks not getting executed

Tags:

python

airflow

Preamble

Yet another airflow tasks not getting executed question...

Everything was going more or less fine in my airflow experience up until this weekend when things really went downhill.

I have checked all the standard things e.g. as outlined in this helpful post.

I have reset the whole instance multiple times trying to get it working properly but I am totally losing the battle here.

Environment

  • version: airflow 1.10.2
  • os: centos 7
  • python: python 3.6
  • virtualenv: yes
  • executor: LocalExecutor
  • backend db: mysql

The problem

Here's what happens in my troubleshooting infinite loop / recurring nightmare.

  1. I reset the metadata DB (or possibly the whole virtualenv and config etc) and re-enter connection information.
  2. Tasks will get executed once. They may succeed. If I missed something in the setup, a task may fail.
  3. When task fails, it goes to retry state.
  4. I fix the issue with (e.g. forgot to enter a connection) and manually clear the task instance.
  5. Cleared task instances do not run, but just sit in a "none" state
  6. Attempts to get dag running again fail.

Before I started having this trouble, after a cleared a task instance, it would always very quickly get picked up and executed again.

But now, clearing the task instance usually results in the task instance getting stuck in a cleared state. It just sits there.

Worse, if I try failing the dag and all instances, and manually triggering the dag again, the task instances get created but stay in 'none' state. Restarting scheduler doesn't help.

Other observation

This is probably a red herring, but one thing I have noticed only recently is that when I click on the icon representing the task instances stuck in the 'none' state, it takes me to a "task instances" view filter that has the wrong filter; the filter is set at "string equals null".

But you need to switch it to "string empty yes" in order to have it actually return the task instances that are stuck.

I am assuming this is just an unrelated UI bug, a red herring as far as I am concerned, but I thought I'd mention it just in case.

Edit 1

I am noticing that there is some "null operator" going on: why is my operator null?  i will look into it

Edit 2

Is null a valid value for task instance state? Or is this an indicator that something is wrong.

Is it legit to have a null task instance state?

Edit 3

More none stuff.

Here are some bits from the task instance details page. Lots of attributes are none:

Task Instance Details
Dependencies Blocking Task From Getting Scheduled
Dependency  Reason
Unknown All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
- The scheduler is down or under heavy load
- The following configuration values may be limiting the number of queueable processes: parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count
- This task instance already ran and had its state changed manually (e.g. cleared in the UI)

If this task instance does not start soon please contact your Airflow administrator for assistance.
Task Instance Attributes
Attribute   Value
duration    None
end_date    None
is_premature    False
job_id  None
operator    None
pid None
queued_dttm None
raw False
run_as_user None
start_date  None
state   None

Update

I may finally be on to something...

After my nightmarish, marathon, stuck-in-twilight-zone troubleshooting session, I threw my hands up and resolved to use docker containers instead of running natively. It was just too weird. Things were just not making sense. I needed to move to docker so that the environment could be completely controlled and reproduced.

So I started working on the docker setup based on puckel/docker-airflow. This was no trivial task either, because I decided to use environment variables for all parameters and connections. Not all hooks parse connection URIs the same way, so you have to be careful and look at the code and do some trial and error.

So then I did that, I finally got my docker setup working locally. But when I went to build the image on my EC2 instance, I found that the disk was full. And it was in no small part due to airflow logs that it was full.

So, my new theory is that lack of disk space may have had something to do with this. I am not sure if I will be able to find a smoking gun in the logs, but I will look.

like image 540
dstandish Avatar asked Mar 11 '19 14:03

dstandish


People also ask

Why is my Airflow DAG not running?

When Airflow evaluates your DAG file, it interprets datetime. now() as the current timestamp (i.e. NOT a time in the past) and decides that it's not ready to run. To properly trigger your DAG to run, make sure to insert a fixed time in the past and set catchup=False if you don't want to perform a backfill.

How do you rerun failed Airflow task?

To rerun a task in Airflow you clear the task status to update the max_tries and current task instance state values in the metastore. After the task reruns, the max_tries value updates to 0 , and the current task instance state updates to None .

How do you clear an Airflow task?

Click on the "select all" checkbox at the top of the list to select all of the queued tasks. Now, in the "Actions" menu, select "Clear" and apply it to all of the queued tasks. Confirm your choice to Clear the queued tasks. Airflow should immediately prepare to run the queued tasks.

Is Start_date mandatory in Airflow DAG?

This is no longer required. Airflow will now auto align the start_date and the schedule , by using the start_date as the moment to start looking.


1 Answers

Ok I am closing this out and marking the presumptive root cause as server was out of space.

There were a number of contributing factors:

  1. My server did not have a lot of storage. Only 10GB. I did not realize it was so low. Resolution: add more space
  2. Logging in airflow 1.10.2 went a little crazy. An INFO log message was outputting Harvesting DAG parsing results every second or two, which resulted, eventually, in a large log file. Resolution: This is fixed in commit [AIRFLOW-3911] Change Harvesting DAG parsing results to DEBUG log level (#4729), which is in 1.10.3, but you can always fork and cherry pick if you are stuck on 1.10.2.
  3. Additionally, some of my scheduler / webserver interval params could have benefited from an increase. As a result I ended up with multi-GB log files. I think this may have been partly due to changing airflow versions without correctly updating airflow.cfg. Solution: when upgrading (or changing versions), temporarily move airflow.cfg so that a cfg compatible with the version will be generated, then merge them carefully. Another strategy is to rely only on environment variables, so that your config should always be as fresh install, and the only parameters in your env variables are parameter overrides and, possibly, connections.
  4. Airflow may not log errors anywhere in this case; everything looked fine, except the scheduler was not queuing up jobs, or it would queue one or two and then just stop, without any error message. Solutions can include (1) add out-of-space alarms on your cloud computing provider, (2) figure out how to ensure scheduler raises some helpful exception in this case and contribute them to airflow.
like image 130
dstandish Avatar answered Oct 19 '22 22:10

dstandish