Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWX all jobs stop processing and hang indefinitely -- why

Problem

We've had a working Ansible AWX instance running on v5.0.0 for over a year, and suddenly all jobs stop working -- no output is rendered. They will start "running" but hang indefinitely without printing out any logging.

The AWX instance is running in a docker compose container setup as defined here: https://github.com/ansible/awx/blob/5.0.0/INSTALL.md#docker-compose

Observations

Standard troubleshooting such as restarting of containers, host OS, etc. hasn't helped. No configuration changes in either environment.

Upon debugging an actual playbook command, we observe that the command to run a playbook from the UI is like the below:

ssh-agent sh -c ssh-add /tmp/awx_11021_0fmwm5uz/artifacts/11021/ssh_key_data && rm -f /tmp/awx_11021_0fmwm5uz/artifacts/11021/ssh_key_data && ansible-playbook -vvvvv -u ubuntu --become --ask-vault-pass -i /tmp/awx_11021_0fmwm5uz/tmppo7rcdqn -e @/tmp/awx_11021_0fmwm5uz/env/extravars playbook.yml

That's broken down into three commands in sequence:

  1. ssh-agent sh -c ssh-add /tmp/awx_11021_0fmwm5uz/artifacts/11021/ssh_key_data
  2. rm -f /tmp/awx_11021_0fmwm5uz/artifacts/11021/ssh_key_data
  3. ansible-playbook -vvvvv -u ubuntu --become --ask-vault-pass -i /tmp/awx_11021_0fmwm5uz/tmppo7rcdqn -e @/tmp/awx_11021_0fmwm5uz/env/extravars playbook.yml

You can see in part 3, the -vvvvv is the debugging argument -- however, the hang is happening on command #1. Which has nothing to do with ansible or AWX specifically, but it's not going to get us much debugging info.

I tried doing an strace to see what is going on, but for reasons given below, it is pretty difficult to follow what it is actually hanging on. I can provide this output if it might help.

Analysis

So one natural question with command #1 -- what is 'ssh_key_data'?

Well it's what we set up to be the Machine credential in AWX (an SSH key) -- it hasn't changed in a while and it works just fine when used in a direct SSH command. It's also apparently being set up by AWX as a file pipe:

prw------- 1 root root 0 Dec 10 08:29 ssh_key_data

Which starts to explain why it could be potentially hanging (if nothing is being read in from the other side of the pipe).

Running a normal ansible-playbook from command line (and supplying the SSH key in a more normal way) works just fine, so we can still deploy, but only via CLI right now -- it's just AWX that is broken.

Conclusions

So the question then becomes "why now"? And "how to debug"? I have checked the health of awx_postgres, and verified that indeed the Machine credential is present in an expected format (in the main_credential table). I have also verified that can use ssh-agent on the awx_task container without the use of that pipe keyfile. So it really seems to be this piped file that is the problem -- but I haven't been able to glean from any logs where the other side of the pipe (sender) is supposed to be or why they aren't sending the data.

like image 760
Jon Avatar asked Dec 12 '21 02:12

Jon


Video Answer


3 Answers

Had the same issue starting this Friday in the same timeframe as you. Turned out that Crowdstrike (falcon sensor) Agent was the culprit. I'm guessing they pushed a definition update that is breaking or blocking fifo pipes. When we stopped the CS agent, AWX started working correctly again, with no issues. See if you are running a similar security product.

like image 70
Daniel Queen Avatar answered Oct 18 '22 19:10

Daniel Queen


For users of Crowdstrike, the problem is likely related to a policy change implemented by your organization over the weekend:

crowdstrike released version 6.32, which was adopted by many organizations to respond to a log4j vulnerability over the weekend, which introduced some changes around script level inspection.

Script-Based Execution Monitoring is the culprit of the disruption. As other users have said, you can disable crowdstrike entirely and restart AWX jobs to get it working, but for security in production that may not be appropriate.

Instead, you must contact your crowdstrike administrator who will have updated the policy of your instance profile to include Script-Based Execution Monitoring. The policy management GUI has a checkbox which can enable/disable the use of this feature (new in 6.32). Ask them to disable it and send logs to the vendor.

like image 1
user3888177 Avatar answered Oct 18 '22 20:10

user3888177


Confirmed Crowdstrike policy update was the issue with why Ansible Tower stopped working for 48 hrs at our company as well. Disabling the monitor option allowed jobs to run successfully almost instantly.

like image 1
Nick Hopkins Avatar answered Oct 18 '22 20:10

Nick Hopkins