Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rabbit mq - Error while waiting for Mnesia tables

I have installed rabbitmq using helm chart on a kubernetes cluster. The rabbitmq pod keeps restarting. On inspecting the pod logs I get the below error

2020-02-26 04:42:31.582 [warning] <0.314.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-02-26 04:42:31.582 [info] <0.314.0> Waiting for Mnesia tables for 30000 ms, 6 retries left

When I try to do kubectl describe pod I get this error

Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-rabbitmq-0
    ReadOnly:   false
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      rabbitmq-config
    Optional:  false
  healthchecks:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      rabbitmq-healthchecks
    Optional:  false
  rabbitmq-token-w74kb:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  rabbitmq-token-w74kb
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/arch=amd64
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                      From                                               Message
  ----     ------     ----                     ----                                               -------
  Warning  Unhealthy  3m27s (x878 over 7h21m)  kubelet, gke-analytics-default-pool-918f5943-w0t0  Readiness probe failed: Timeout: 70 seconds ...
Checking health of node [email protected] ...
Status of node [email protected] ...
Error:
{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :"$1", :_, :_}, [], [:"$1"]}]]}}
Error:
{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :"$1", :_, :_}, [], [:"$1"]}]]}}

I have provisioned the above on Google Cloud on a kubernetes cluster. I am not sure during what specific situation it started failing. I had to restart the pod and since then it has been failing.

What is the issue here ?

like image 451
jeril Avatar asked Feb 26 '20 05:02

jeril


People also ask

How to restart RabbitMQ from Mnesia state?

... Will be able to clear mnesia state to restart; there's nothing really valuable there. Edit rabbitmq1-rabbitmq statefulset in insert 'rabbitmqctl force_boot' Find existing rabbitmq pods. Delete the pods from Step 2. Wait for rabbitmq pod (s) to be re-created and OpenStack Deployment State to say RUNNING.

What happens when you shut down a RabbitMQ cluster?

I am not a RabbitMQ expert but in the documentation you can see this: Normally when you shut down a RabbitMQ cluster altogether, the first node you restart should be the last one to go down, since it may have seen things happen that other nodes did not.

Should I use persistence or availability for RabbitMQ?

We'd really prefer availability over integrity. Don't use persistence. Are you able to know the order in which the nodes were shut down? I am not a RabbitMQ expert but in the documentation you can see this:

What happens when the cookie is misconfigured in RabbitMQ?

When the cookie is misconfigured (for example, not identical), RabbitMQ nodes will log errors such as "Connection attempt from disallowed node", "", "Could not auto-cluster". For example, when a CLI tool connects and tries to authenticate using a mismatching secret value:


1 Answers

TLDR

helm upgrade rabbitmq --set clustering.forceBoot=true

Problem

The problem happens for the following reason:

  • All RMQ pods are terminated at the same time due to some reason (maybe because you explicitly set the StatefulSet replicas to 0, or something else)
  • One of them is the last one to stop (maybe just a tiny bit after the others). It stores this condition ("I'm standalone now") in its filesystem, which in k8s is the PersistentVolume(Claim). Let's say this pod is rabbitmq-1.
  • When you spin the StatefulSet back up, the pod rabbitmq-0 is always the first to start (see here).
  • During startup, pod rabbitmq-0 first checks whether it's supposed to run standalone. But as far as it can see on its own filesystem, it's part of a cluster. So it checks for its peers and doesn't find any. This results in a startup failure by default.
  • rabbitmq-0 thus never becomes ready.
  • rabbitmq-1 is never starting because that's how StatefulSets are deployed - one after another. If it were to start, it would start successfully because it sees that it can run standalone as well.

So in the end, it's a bit of a mismatch between how RabbitMQ and StatefulSets work. RMQ says: "if everything goes down, just start everything and the same time, one will be able to start and as soon as this one is up, the others can rejoin the cluster." k8s StatefulSets say: "starting everything all at once is not possible, we'll start with the 0".

Solution

To fix this, there is a force_boot command for rabbitmqctl which basically tells an instance to start standalone if it doesn't find any peers. How you can use this from Kubernetes depends on the Helm chart and container you're using. In the Bitnami Chart, which uses the Bitnami Docker image, there is a value clustering.forceBoot = true, which translates to an env variable RABBITMQ_FORCE_BOOT = yes in the container, which will then issue the above command for you.

But looking at the problem, you can also see why deleting PVCs will work (other answer). The pods will just all "forget" that they were part of a RMQ cluster the last time around, and happily start. I would prefer the above solution though, as no data is being lost.

like image 164
Ulli Avatar answered Oct 25 '22 20:10

Ulli