I have a RabbitMQ Server that serves as a queue hub for one of my systems. In the last week or so, its producers come to a complete halt every few hours.
Following cantSleepNow's answer, I have increased the memory allocated to RabbitMQ to 90%. The server has a whopping 16GB of memory and the message count is not very high (millions per day), so that does not seem to be the problem.
From the command line:
sudo rabbitmqctl set_vm_memory_high_watermark 0.9
And with /etc/rabbitmq/rabbitmq.config
:
[
{rabbit,
[
{loopback_users, []},
{vm_memory_high_watermark, 0.9}
]
}
].
I use Python for all consumers and producers.
The producers are API server that serve calls. Whenever a call arrives, a connection is opened, a message is sent and the connection is closed.
from kombu import Connection
def send_message_to_queue(host, port, queue_name, message):
"""Sends a single message to the queue."""
with Connection('amqp://guest:guest@%s:%s//' % (host, port)) as conn:
simple_queue = conn.SimpleQueue(name=queue_name, no_ack=True)
simple_queue.put(message)
simple_queue.close()
The consumers slightly differ from each other, but generally use the following pattern - opening a connection, and waiting on it until a message arrives. The connection can stay opened for long period of times (say, days).
with Connection('amqp://whatever:whatever@whatever:whatever//') as conn:
while True:
queue = conn.SimpleQueue(queue_name)
message = queue.get(block=True)
message.ack()
This design had caused no problems till about one week ago.
The web console shows that the consumers in 127.0.0.1
and 172.31.38.50
block the consumers from 172.31.38.50
, 172.31.39.120
, 172.31.41.38
and 172.31.41.38
.
Just to be on the safe side, I checked the server load. As expected, the load average and CPU utilization metrics are low.
Why does the rabbit MQ each such a deadlock?
This is most likely caused by a memory leak in the management module for RabbitMQ 3.6.2. This has now been fixed in RabbitMQ 3.6.3, and is available here.
The issue itself is described here, but is also discussed extensively on the RabbitMQ messages boards; for example here and here. This has also been known to cause a lot of weird issues, a good example is the issue reported here.
As a temporary fix until the new version is released, you can either upgrade to the new est build, downgrade to 3.6.1 or completely disable the management module.
I'm writing this as an answer, partially because it may help and partially because it's too large to be a comment.
First I'm sorry for missing this message = queue.get(block=True)
. Also a disclaimer - I'm not familiar with python nor PIKA API.
AMQP's basic.get
is actually synchronous and you are setting the block=true
. As I said, don't know what this means in PIKA, but in combination with constantly pooling the queue, doesn't sound efficient. So it could be that for what ever reason, publisher get's denied a connection due to queue access being blocked by the consumer. It actually fits perfectly with how you temporally resolve the issue by Stopping the consumers releases the lock for a few minutes, but then blocking returns.
I'd recommend trying with AMQP's basic.consume
instead of basic.get
. I don't know what is the motivation for get, but in most of cases (my experience anyway) you should go with consume. Just to quote from the aforementioned link
This method provides a direct access to the messages in a queue using a synchronous dialogue that is designed for specific types of application where synchronous functionality is more important than performance.
In RabbitMQ docs it says the connection gets blocked when the broker is low on resources, but as you wrote the load is quite low. Just to be safe, you may check memory consumption and free disk space.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With