Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ansible **sporadically** fails with host unreachable - Failed to connect to the host via ssh

We are using ansible to provision multiple nodes as a cluster. The machines are instances created on a custom AWS similar infrastructure. We have about a hundred tasks on different playbooks and they are executed on each node.

The problem is, we are getting sporadic host unreachable errors and playbook execution stops with the following failure:

TASK [common : install basic packages] *************************
fatal: [fqdn.for.a.node]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh.", "unreachable": true}

Output with -vvv:

TASK [common : install basic packages] *******************************
task path: /jenkins/workspace/Cluster-Deployment/91/roles/common/tasks/install-basic-packages.yml:1
<fqdn.for.a.node> ESTABLISH SSH CONNECTION FOR USER: root
<fqdn.for.a.node> SSH: EXEC ssh -C -q -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o 'IdentityFile="id_rsa"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=600 -o ControlPath=/home/turkenh/.ansible/cp/ansible-ssh-%h-%p-%r fqdn.for.a.node '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo $HOME/.ansible/tmp/ansible-tmp-1466523588.96-210828884892875 `" && echo ansible-tmp-1466523588.96-210828884892875="` echo $HOME/.ansible/tmp/ansible-tmp-1466523588.96-210828884892875 `" ) && sleep 0'"'"''
failed: [fqdn.for.a.node] (item=[u'unzip']) => {"item": ["unzip"], "msg": "Failed to connect to the host via ssh.", "unreachable": true}

Here is our ansible.cfg file:

[defaults]
forks = 50
sudo_flags=-i
nocows=1

# do not check host key while doing ssh
host_key_checking = False
# use openssh not paramiko
transport = ssh
private_key_file = id_rsa
remote_user = root

Please see our notes below:

  • When we try to ping (with ansible ping module, not ping shell command) that host with ansible right after the failure, it throws the same error, but if we wait for about a minute or so, we can ping it.

  • What we can state about our custom AWS based infrastructure is that, somehow, there might be some sporadic connection issues from time to time which does not take longer than say 1-2 minutes.

  • Tried setting timeout parameter to a big number (i.e. 600) in ansible.cfg but it did not help.

  • We are provisioning nodes ubuntu, redhat and suse but no matter the OS, we are getting this error for around a probability of 20%.

  • It is not the same or similar tasks in my playbook where it fails, it is just failing at random ones. (sometimes in setup module, sometimes in package module, ...)

  • Our ansible version is 2.1 (installed with pip), os of the workstation is Ubuntu 14.04

So, what we need is, somehow, say to ansible, if you see a node as unreachable, please do not give up with a failure. Please wait for some time or retry n times before giving up with unreachable. How can we do this?

like image 669
turkenh Avatar asked Jun 21 '16 22:06

turkenh


People also ask

How do you fix an unreachable error in Ansible?

Resetting unreachable hosts If Ansible cannot connect to a host, it marks that host as 'UNREACHABLE' and removes it from the list of active hosts for the run. You can use meta: clear_host_errors to reactivate all hosts, so subsequent tasks can try to reach them again.

How do I ssh in Ansible?

Use the authorized_key Ansible module to copy the public ssh key (kept in the same folder as the Ansible project) and place it on the server in the . ssh/authorized_keys file. After this step it is possible to connect to the server using the ssh keys alone. There is still one step left to do though.

How do I restart Ansible service?

Use systemctl restart ansible-tower to restart services on clustered environments instead. Also you must restart each cluster node for certain changes to persist as opposed to a single node for a localhost install.

How do I ping Ansible host?

By default, Ansible tries to connect to the nodes as your current system user, using its corresponding SSH keypair. To connect as a different user, append the command with the -u flag and the name of the intended user: ansible all -m ping -u sammy.

Why Ansible failed to connect to the host via SSH?

In short, Ansibles error failed to connect to the host via ssh occurs due to improper SSH configuration or incorrect Ansible Inventory file. Today, we saw how our Support Engineers fix this error.

How does Ansible manage hosts?

It connects to the hosts via SSH and pushes small programs or Ansible modules into the hosts. Ansible executes these modules and removes them when it is done. in general, Ansible manages its hosts using the INI file.

How to fix failed to connect to the host via SSH?

How we fix the error failed to connect to the host via ssh? 1 Tweaking SSH#N#If the error is with SSH configuration, we tweak the SSH settings.#N#Usually, we login to the host using... 2 Correct Inventory file More ...

How can we help you with Ansible errors?

We can help you. Ansible saves time by automation of server tasks. However, improper host entry in the Ansible Inventory file or bad SSH configuration can create connection errors too. At Bobcares, we get requests to fix Ansible errors as a part of our Server Management Services.


1 Answers

Formally answering your question: you may increase number of ssh attempts in your inventory file with ansible_ssh_common_args="-o ConnectionAttempts=20". Specify it for problem host, group of hosts or all virtual group (e.g. in group_vars/all.yml file).

There is also ssh_args configuration option, but I prefer not to modify it, because it overwrites the ansible default ssh arguments.

like image 185
Konstantin Suvorov Avatar answered Oct 05 '22 00:10

Konstantin Suvorov