Preamble Yet another airflow tasks not getting executed question... Everything was going more or less fine in my airflow experience up until this weekend when things really went downhill. I have checked all the standard things e.g. as outlined in this helpful post. I have reset the whole instance multiple times trying to get it working properly but I am totally losing the battle here. Environment <ul> <li>version: airflow 1.10.2 </li> <li>os: centos 7 </li> <li>python: python 3.6 </li> <li>virtualenv: yes </li> <li>executor: LocalExecutor </li> <li>backend db: mysql</li> </ul> The problem Here's what happens in my troubleshooting infinite loop / recurring nightmare. <ol> <li>I reset the metadata DB (or possibly the whole virtualenv and config etc) and re-enter connection information. </li> <li>Tasks will get executed once. They may succeed. If I missed something in the setup, a task may fail.</li> <li>When task fails, it goes to retry state. </li> <li>I fix the issue with (e.g. forgot to enter a connection) and manually clear the task instance. </li> <li>Cleared task instances do not run, but just sit in a "none" state</li> <li>Attempts to get dag running again fail. </li> </ol> Before I started having this trouble, after a cleared a task instance, it would always very quickly get picked up and executed again. But now, clearing the task instance usually results in the task instance getting stuck in a cleared state. It just sits there. Worse, if I try failing the dag and all instances, and manually triggering the dag again, the task instances get created but stay in 'none' state. Restarting scheduler doesn't help. Other observation This is probably a red herring, but one thing I have noticed only recently is that when I click on the icon representing the task instances stuck in the 'none' state, it takes me to a "task instances" view filter that has the wrong filter; the filter is set at "string equals null". But you need to switch it to "string empty yes" in order to have it actually return the task instances that are stuck. I am assuming this is just an unrelated UI bug, a red herring as far as I am concerned, but I thought I'd mention it just in case. Edit 1 I am noticing that there is some "null operator" going on: <img src="https://i.stack.imgur.com/HtAzo.png" alt="why is my operator null? i will look into it"> Edit 2 Is <code>null</code> a valid value for task instance state? Or is this an indicator that something is wrong. <img src="https://i.stack.imgur.com/4VfQA.png" alt="Is it legit to have a null task instance state?"> Edit 3 More <code>none</code> stuff. Here are some bits from the task instance details page. Lots of attributes are <code>none</code>: <pre class="prettyprint"><code>Task Instance Details Dependencies Blocking Task From Getting Scheduled Dependency Reason Unknown All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless: - The scheduler is down or under heavy load - The following configuration values may be limiting the number of queueable processes: parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count - This task instance already ran and had its state changed manually (e.g. cleared in the UI) If this task instance does not start soon please contact your Airflow administrator for assistance. Task Instance Attributes Attribute Value duration None end_date None is_premature False job_id None operator None pid None queued_dttm None raw False run_as_user None start_date None state None </code></pre> Update I may finally be on to something... After my nightmarish, marathon, stuck-in-twilight-zone troubleshooting session, I threw my hands up and resolved to use docker containers instead of running natively. It was just too weird. Things were just not making sense. I needed to move to docker so that the environment could be completely controlled and reproduced. So I started working on the docker setup based on puckel/docker-airflow. This was no trivial task either, because I decided to use environment variables for all parameters and connections. Not all hooks parse connection URIs the same way, so you have to be careful and look at the code and do some trial and error. So then I did that, I finally got my docker setup working locally. But when I went to build the image on my EC2 instance, I found that the disk was full. And it was in no small part due to airflow logs that it was full. So, my new theory is that lack of disk space may have had something to do with this. I am not sure if I will be able to find a smoking gun in the logs, but I will look.

Ok I am closing this out and marking the presumptive root cause as server was out of space. There were a number of contributing factors: <ol> <li>My server did not have a lot of storage. Only 10GB. I did not realize it was so low. Resolution: add more space</li> <li>Logging in airflow 1.10.2 went a little crazy. An <code>INFO</code> log message was outputting <code>Harvesting DAG parsing results</code> every second or two, which resulted, eventually, in a large log file. Resolution: This is fixed in commit <code>[AIRFLOW-3911] Change Harvesting DAG parsing results to DEBUG log level (#4729)</code>, which is in 1.10.3, but you can always fork and cherry pick if you are stuck on 1.10.2. </li> <li>Additionally, some of my scheduler / webserver interval params could have benefited from an increase. As a result I ended up with multi-GB log files. I think this may have been partly due to changing airflow versions without correctly updating <code>airflow.cfg</code>. Solution: when upgrading (or changing versions), temporarily move <code>airflow.cfg</code> so that a cfg compatible with the version will be generated, then merge them carefully. Another strategy is to rely only on environment variables, so that your config should always be as fresh install, and the only parameters in your env variables are parameter overrides and, possibly, connections.</li> <li>Airflow may not log errors anywhere in this case; everything looked fine, except the scheduler was not queuing up jobs, or it would queue one or two and then just stop, without any error message. Solutions can include (1) add out-of-space alarms on your cloud computing provider, (2) figure out how to ensure scheduler raises some helpful exception in this case and contribute them to airflow.</li> </ol>

airflow cleared tasks not getting executed

Tags:

python

airflow

Preamble

Yet another airflow tasks not getting executed question...

Everything was going more or less fine in my airflow experience up until this weekend when things really went downhill.

I have checked all the standard things e.g. as outlined in this helpful post.

I have reset the whole instance multiple times trying to get it working properly but I am totally losing the battle here.

Environment

version: airflow 1.10.2
os: centos 7
python: python 3.6
virtualenv: yes
executor: LocalExecutor
backend db: mysql

The problem

Here's what happens in my troubleshooting infinite loop / recurring nightmare.

I reset the metadata DB (or possibly the whole virtualenv and config etc) and re-enter connection information.
Tasks will get executed once. They may succeed. If I missed something in the setup, a task may fail.
When task fails, it goes to retry state.
I fix the issue with (e.g. forgot to enter a connection) and manually clear the task instance.
Cleared task instances do not run, but just sit in a "none" state
Attempts to get dag running again fail.

Before I started having this trouble, after a cleared a task instance, it would always very quickly get picked up and executed again.

But now, clearing the task instance usually results in the task instance getting stuck in a cleared state. It just sits there.

Worse, if I try failing the dag and all instances, and manually triggering the dag again, the task instances get created but stay in 'none' state. Restarting scheduler doesn't help.

Other observation

This is probably a red herring, but one thing I have noticed only recently is that when I click on the icon representing the task instances stuck in the 'none' state, it takes me to a "task instances" view filter that has the wrong filter; the filter is set at "string equals null".

But you need to switch it to "string empty yes" in order to have it actually return the task instances that are stuck.

I am assuming this is just an unrelated UI bug, a red herring as far as I am concerned, but I thought I'd mention it just in case.

Edit 1

I am noticing that there is some "null operator" going on: why is my operator null? i will look into it

Edit 2

Is null a valid value for task instance state? Or is this an indicator that something is wrong.

Is it legit to have a null task instance state?

Edit 3

More none stuff.

Here are some bits from the task instance details page. Lots of attributes are none:

Task Instance Details
Dependencies Blocking Task From Getting Scheduled
Dependency  Reason
Unknown All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
- The scheduler is down or under heavy load
- The following configuration values may be limiting the number of queueable processes: parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count
- This task instance already ran and had its state changed manually (e.g. cleared in the UI)

If this task instance does not start soon please contact your Airflow administrator for assistance.
Task Instance Attributes
Attribute   Value
duration    None
end_date    None
is_premature    False
job_id  None
operator    None
pid None
queued_dttm None
raw False
run_as_user None
start_date  None
state   None

Update

I may finally be on to something...

After my nightmarish, marathon, stuck-in-twilight-zone troubleshooting session, I threw my hands up and resolved to use docker containers instead of running natively. It was just too weird. Things were just not making sense. I needed to move to docker so that the environment could be completely controlled and reproduced.

So I started working on the docker setup based on puckel/docker-airflow. This was no trivial task either, because I decided to use environment variables for all parameters and connections. Not all hooks parse connection URIs the same way, so you have to be careful and look at the code and do some trial and error.

So then I did that, I finally got my docker setup working locally. But when I went to build the image on my EC2 instance, I found that the disk was full. And it was in no small part due to airflow logs that it was full.

So, my new theory is that lack of disk space may have had something to do with this. I am not sure if I will be able to find a smoking gun in the logs, but I will look.

540

asked Mar 11 '19 14:03

dstandish

1 Answers

Ok I am closing this out and marking the presumptive root cause as server was out of space.

There were a number of contributing factors:

My server did not have a lot of storage. Only 10GB. I did not realize it was so low. Resolution: add more space
Logging in airflow 1.10.2 went a little crazy. An INFO log message was outputting Harvesting DAG parsing results every second or two, which resulted, eventually, in a large log file. Resolution: This is fixed in commit [AIRFLOW-3911] Change Harvesting DAG parsing results to DEBUG log level (#4729), which is in 1.10.3, but you can always fork and cherry pick if you are stuck on 1.10.2.
Additionally, some of my scheduler / webserver interval params could have benefited from an increase. As a result I ended up with multi-GB log files. I think this may have been partly due to changing airflow versions without correctly updating airflow.cfg. Solution: when upgrading (or changing versions), temporarily move airflow.cfg so that a cfg compatible with the version will be generated, then merge them carefully. Another strategy is to rely only on environment variables, so that your config should always be as fresh install, and the only parameters in your env variables are parameter overrides and, possibly, connections.
Airflow may not log errors anywhere in this case; everything looked fine, except the scheduler was not queuing up jobs, or it would queue one or two and then just stop, without any error message. Solutions can include (1) add out-of-space alarms on your cloud computing provider, (2) figure out how to ensure scheduler raises some helpful exception in this case and contribute them to airflow.

130

answered Oct 19 '22 22:10

dstandish

Related questions
                            
                                How to integrate Airflow with Github for running scripts
                            
                                Multiple key value lookup for dictionary in python
                            
                                SSH key generated by ssh-keygen is not recognized by Paramiko: "not a valid RSA private key file"
                            
                                Django OAuth Toolkit giving me "Authentication credentials were not provided" error
                            
                                Lowess Smoothing of Time Series data python
                            
                                Django: Direct assignment to the reverse side of a related set is prohibited. Use username.set() instead
                            
                                matplotlib Circle patch with alpha produces overlap of edge and facecolor
                            
                                How to warp an image using deformed mesh
                            
                                Parsing incomplete json array
                            
                                Compute the last (decimal) digit of x1 ^ (x2 ^ (x3 ^ (... ^ xn))) [duplicate]
                            
                                How are the contents of the builtins module available in the global namespace without import in Python?
                            
                                how to print the default value if argument is None in python
                            
                                Merge on one column or another
                            
                                Weighted histogram plotly
                            
                                How are the output size of MaxPooling2D, Conv2D, UpSampling2D layers calculated?
                            
                                word cloud does not show the frequency of the words correctly
                            
                                Can't store downloaded files in their concerning folders
                            
                                after installing uwsgi, python will still error: No module named 'uwsgi'
                            
                                gRPC server in Python with Unix domain socket
                            
                                Is it possible to link the interactive python window to a running jupyter notebook kernel?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

airflow cleared tasks not getting executed

Tags:

python

airflow

dstandish

People also ask

1 Answers

dstandish

Recent Activity

Donate For Us