We are facing bad performance in our RabbitMQ clusters. Even when idle.
Once installed the rabbitmq-top plugin, we see many processes with very high reductions/sec. 100k and more!
Questions:
- What does it mean?
- How to control it?
- What might be causing such slowness without any errors?
Notes:
- Our clusters are running on Kubernetes 1.15.11
- We allocated 3 nodes, each with 8 CPU and 8 GB limits. Set vm_watermark to 7G. Actual usage is ~1.5 CPU and 1 GB RAM
- RabbitMQ 3.8.2. Erlang 22.1
- We don't have many consumers and producers. The slowness is also on a fairly idle environment
- The
rabbitmqctl status is very slow to return details (sometimes 2 minutes) but does not show any errors
After some more investigation, we found the actual reason was made up of two issues.
- RabbitMQ (Erlang) run time configuration by default (using the bitnami helm chart) assigns only a single scheduler. This is good for some simple app with a few concurrent connections. Production grade with 1000s of connections have to use many more schedulers. Bumping up from 1 to 8 schedulers improved throughput dramatically.
- Our monitoring that was hammering RabbitMQ with a lot of requests per seconds (about 100/sec). The monitoring hits the
aliveness-test, which creates a connection, declares a queue (not mirrored), publishes a message and then consumes that message. Disabling the monitoring reduced load dramatically. 80%-90% drop in CPU usage and the reductions/sec also dropped by about 90%.
References
Performance:
- https://www.rabbitmq.com/runtime.html#scheduling
- https://www.rabbitmq.com/blog/2020/06/04/how-to-run-benchmarks/
- https://www.rabbitmq.com/blog/2020/08/10/deploying-rabbitmq-to-kubernetes-whats-involved/
- https://www.rabbitmq.com/runtime.html#cpu-reduce-idle-usage
Monitoring:
- http://rabbitmq.1065348.n5.nabble.com/RabbitMQ-API-aliveness-test-td32723.html
- https://groups.google.com/forum/#!topic/rabbitmq-users/9pOeHlhQoHA
- https://www.rabbitmq.com/monitoring.html