Preamble
Yet another airflow tasks not getting executed question...
Everything was going more or less fine in my airflow experience up until this weekend when things really went downhill.
I have checked all the standard things e.g. as outlined in this helpful post.
I have reset the whole instance multiple times trying to get it working properly but I am totally losing the battle here.
Environment
The problem
Here's what happens in my troubleshooting infinite loop / recurring nightmare.
Before I started having this trouble, after a cleared a task instance, it would always very quickly get picked up and executed again.
But now, clearing the task instance usually results in the task instance getting stuck in a cleared state. It just sits there.
Worse, if I try failing the dag and all instances, and manually triggering the dag again, the task instances get created but stay in 'none' state. Restarting scheduler doesn't help.
Other observation
This is probably a red herring, but one thing I have noticed only recently is that when I click on the icon representing the task instances stuck in the 'none' state, it takes me to a "task instances" view filter that has the wrong filter; the filter is set at "string equals null".
But you need to switch it to "string empty yes" in order to have it actually return the task instances that are stuck.
I am assuming this is just an unrelated UI bug, a red herring as far as I am concerned, but I thought I'd mention it just in case.
Edit 1
I am noticing that there is some "null operator" going on:
Edit 2
Is null
a valid value for task instance state? Or is this an indicator that something is wrong.
Edit 3
More none
stuff.
Here are some bits from the task instance details page. Lots of attributes are none
:
Task Instance Details
Dependencies Blocking Task From Getting Scheduled
Dependency Reason
Unknown All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
- The scheduler is down or under heavy load
- The following configuration values may be limiting the number of queueable processes: parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count
- This task instance already ran and had its state changed manually (e.g. cleared in the UI)
If this task instance does not start soon please contact your Airflow administrator for assistance.
Task Instance Attributes
Attribute Value
duration None
end_date None
is_premature False
job_id None
operator None
pid None
queued_dttm None
raw False
run_as_user None
start_date None
state None
Update
I may finally be on to something...
After my nightmarish, marathon, stuck-in-twilight-zone troubleshooting session, I threw my hands up and resolved to use docker containers instead of running natively. It was just too weird. Things were just not making sense. I needed to move to docker so that the environment could be completely controlled and reproduced.
So I started working on the docker setup based on puckel/docker-airflow. This was no trivial task either, because I decided to use environment variables for all parameters and connections. Not all hooks parse connection URIs the same way, so you have to be careful and look at the code and do some trial and error.
So then I did that, I finally got my docker setup working locally. But when I went to build the image on my EC2 instance, I found that the disk was full. And it was in no small part due to airflow logs that it was full.
So, my new theory is that lack of disk space may have had something to do with this. I am not sure if I will be able to find a smoking gun in the logs, but I will look.
When Airflow evaluates your DAG file, it interprets datetime. now() as the current timestamp (i.e. NOT a time in the past) and decides that it's not ready to run. To properly trigger your DAG to run, make sure to insert a fixed time in the past and set catchup=False if you don't want to perform a backfill.
To rerun a task in Airflow you clear the task status to update the max_tries and current task instance state values in the metastore. After the task reruns, the max_tries value updates to 0 , and the current task instance state updates to None .
Click on the "select all" checkbox at the top of the list to select all of the queued tasks. Now, in the "Actions" menu, select "Clear" and apply it to all of the queued tasks. Confirm your choice to Clear the queued tasks. Airflow should immediately prepare to run the queued tasks.
This is no longer required. Airflow will now auto align the start_date and the schedule , by using the start_date as the moment to start looking.
Ok I am closing this out and marking the presumptive root cause as server was out of space.
There were a number of contributing factors:
INFO
log message was outputting Harvesting DAG parsing results
every second or two, which resulted, eventually, in a large log file. Resolution: This is fixed in commit [AIRFLOW-3911] Change Harvesting DAG parsing results to DEBUG log level (#4729)
, which is in 1.10.3, but you can always fork and cherry pick if you are stuck on 1.10.2. airflow.cfg
. Solution: when upgrading (or changing versions), temporarily move airflow.cfg
so that a cfg compatible with the version will be generated, then merge them carefully. Another strategy is to rely only on environment variables, so that your config should always be as fresh install, and the only parameters in your env variables are parameter overrides and, possibly, connections.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With