Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RabbitMQ cluster is not reconnecting after network failure

I have a RabbitMQ cluster with two nodes in production and the cluster is breaking with these error messages:

=ERROR REPORT==== 23-Dec-2011::04:21:34 ===
** Node rabbit@rabbitmq02 not responding **
** Removing (timedout) connection **

=INFO REPORT==== 23-Dec-2011::04:21:35 ===
node rabbit@rabbitmq02 lost 'rabbit'

=ERROR REPORT==== 23-Dec-2011::04:21:49 ===
Mnesia(rabbit@rabbitmq01): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@rabbitmq02}

I tried to simulate the problem by killing the connection between the two nodes using "tcpkill". The cluster has disconnected, and surprisingly the two nodes are not trying to reconnect!

When the cluster breaks, HAProxy load balancer still marks both nodes as active and send requests to both of them, although they are not in a cluster.

My questions:

  1. If the nodes are configured to work as a cluster, when I get a network failure, why aren't they trying to reconnect afterwards?

  2. How can I identify broken cluster and shutdown one of the nodes? I have consistency problems when working with the two nodes separately.

like image 219
Ranch Avatar asked Dec 28 '11 09:12

Ranch


People also ask

How do you reset a RabbitMQ cluster?

To reset a running and responsive node, first stop RabbitMQ on it using rabbitmqctl stop_app and then reset it using rabbitmqctl reset: # on rabbit1 rabbitmqctl stop_app # => Stopping node rabbit@rabbit1 ... done.

How do I connect to RabbitMQ cluster?

In order for a client to interact with RabbitMQ it must first open a connection. This process involves a number of steps: Application configures the client library it uses to use a certain connection endpoint (e.g. hostname and port) The library resolves the hostname to one or more IP addresses.

How do you test RabbitMQ connectivity?

Here are the recommended steps: Make sure the node is running using rabbitmq-diagnostics status. Verify config file is correctly placed and has correct syntax/structure. Inspect listeners using rabbitmq-diagnostics listeners or the listeners section in rabbitmq-diagnostics status.


1 Answers

RabbitMQ Clusters do not work well on unreliable networks (part of RabbitMQ documentation). So when the network failure happens (in a two node cluster) each node thinks that it is the master and the only node in the cluster. Two master nodes don't automatically reconnect, because their states are not automatically synchronized (even in case of a RabbitMQ slave - the actual message synchronization does not happen - the slave just "catches up" as messages get consumed from the queue and more messages get added).

To detect whether you have a broken cluster, run the command:

rabbitmqctl cluster_status

on each of the nodes that form part of the cluster. If the cluster is broken then you'll only see one node. Something like:

Cluster status of node rabbit@rabbitmq1 ...
[{nodes,[{disc,[rabbit@rabbitmq1]}]},{running_nodes,[rabbit@rabbitmq1]}]
...done.

In such cases, you'll need to run the following set of commands on one of the nodes that formed part of the original cluster (so that it joins the other master node (say rabbitmq1) in the cluster as a slave):

rabbitmqctl stop_app

rabbitmqctl reset

rabbitmqctl join_cluster rabbit@rabbitmq1

rabbitmqctl start_app

Finally check the cluster status again .. this time you should see both the nodes.

Note: If you have the RabbitMQ nodes in an HA configuration using a Virtual IP (and the clients are connecting to RabbitMQ using this virtual IP), then the node that should be made the master should be the one that has the Virtual IP.

like image 193
Gur Kamal Singh Badal Avatar answered Oct 16 '22 23:10

Gur Kamal Singh Badal