Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NServicebus failing while sending messages to a msmq cluster queue in a load balanced environment

We are having MSMQ issues in a load balanced, high volume environment using NServiceBus.

Our environment looks as follows: 1 F5 distributing web traffic via round robin to 6 application servers. Each of these 6 servers uses a Bus.Send to 1 queue on a remote machine that resides on a cluster.

The event throughput during normal usage is approximately 5-10 per second, per server. So 30-60 events per second in the entire environment, depending on load.

The issue we're seeing is that 1 of the application boxes is able to send messages to the cluster queue, but the other 5 are not. Looking at the 5 boxes experiencing failure, the outgoing queue to the cluster is inactive.

There are also a high number of events in the transaction dead letter queue. When we purge that queue, the outgoing queue connects to the cluster, however, the messages grow as unacknowledged in the outgoing queue. This continues to grow until they move into the transaction dead letter queue again, and the outgoing queue changes state to inactive.

Interestingly, when we perform this purge operation, a different box will become the 'good box'. So we're pretty sure that the issue is not one bad box, it's that only 1 box at a time can reliably maintain a connection to the cluster queue.

Has anybody come across this before?

like image 281
darthjit Avatar asked Dec 26 '22 14:12

darthjit


1 Answers

We have, and it was because of the issue described here: http://blogs.msdn.com/b/johnbreakwell/archive/2007/02/06/msmq-prefers-to-be-unique.aspx

Short version: Every MSMQ installation has an unique id assigned to it when you install MSMQ. It is called QMId and located in the registry under

HKLM\Software\Microsoft\MSMQ\Parameters\Machine Cache\QMid

It is used as an identifier when doing send to a remote receiver, which in turn uses it to send ACKs back to the correct sender. The receiver, in your case the cluster, maintains a cache that maps QMIds to IPs. Our problem was that several of our workers had the SAME QMId. This ment the cluster sent all ACKS for all messages from all the machines to the first machine who sent a message. At some point, and for some operations like a MSMQ windows service restart, the cache expires and ANOTHER machine magically "works".

So check your 6 servers and make sure none of them has the same QMid. Ours had the same value because they were all ghosted from a Windows image that was taken after MSMQ was installed.

The fix is easy, just reinstall the MSMQ feature on each machine to generate a new unique QMId.

like image 98
janovesk Avatar answered May 01 '23 16:05

janovesk