We are having MSMQ issues in a load balanced, high volume environment using NServiceBus.
Our environment looks as follows: 1 F5 distributing web traffic via round robin to 6 application servers. Each of these 6 servers uses a Bus.Send to 1 queue on a remote machine that resides on a cluster.
The event throughput during normal usage is approximately 5-10 per second, per server. So 30-60 events per second in the entire environment, depending on load.
The issue we're seeing is that 1 of the application boxes is able to send messages to the cluster queue, but the other 5 are not. Looking at the 5 boxes experiencing failure, the outgoing queue to the cluster is inactive.
There are also a high number of events in the transaction dead letter queue. When we purge that queue, the outgoing queue connects to the cluster, however, the messages grow as unacknowledged in the outgoing queue. This continues to grow until they move into the transaction dead letter queue again, and the outgoing queue changes state to inactive.
Interestingly, when we perform this purge operation, a different box will become the 'good box'. So we're pretty sure that the issue is not one bad box, it's that only 1 box at a time can reliably maintain a connection to the cluster queue.
Has anybody come across this before?
We have, and it was because of the issue described here: http://blogs.msdn.com/b/johnbreakwell/archive/2007/02/06/msmq-prefers-to-be-unique.aspx
Short version: Every MSMQ installation has an unique id assigned to it when you install MSMQ. It is called QMId and located in the registry under
HKLM\Software\Microsoft\MSMQ\Parameters\Machine Cache\QMid
It is used as an identifier when doing send to a remote receiver, which in turn uses it to send ACKs back to the correct sender. The receiver, in your case the cluster, maintains a cache that maps QMIds to IPs. Our problem was that several of our workers had the SAME QMId. This ment the cluster sent all ACKS for all messages from all the machines to the first machine who sent a message. At some point, and for some operations like a MSMQ windows service restart, the cache expires and ANOTHER machine magically "works".
So check your 6 servers and make sure none of them has the same QMid. Ours had the same value because they were all ghosted from a Windows image that was taken after MSMQ was installed.
The fix is easy, just reinstall the MSMQ feature on each machine to generate a new unique QMId.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With