Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

One slow ActiveMQ consumer causing other consumers to be slow

I'm looking for help regarding a strange issue where a slow consumer on a queue causes all the other consumers on the same queue to start consuming messages at 30 second intervals. That is all consumers but the slow one don't consumer messages as fast as they can, instead they wait for some magical 30s barrier before consuming.

The basic flow of my application goes like this:

  1. a number of producers place messages onto a single queue. Messages can have different JMSXGroupIDs
  2. a number of consumers listen to messages on that single queue
  3. as standard practice the JMSXGroupIDs get distributed across the consumers
  4. at some point one of the consumers becomes slow and can't process messages very quickly
  5. the slow consumer ends up filling its prefetch buffer on the broker and AMQ recognises that it is slow (default behaviour)
  6. at that point - or some 'random' but close time later - all consumers except the slow one start to only consume messages at the same 30s intervals
  7. if the slow consumer becomes fast again then things very quickly return to normal operation and the 30s barrier goes away

I'm at a loss for what could be causing this issue, or how to fix it, please help.

More background and findings

  • I've managed to reliably reproduce this issue on AMQ 5.8.0, 5.9.0 (where the issue was originally noticed) and 5.9.1, on fresh installs and existing ops-managed installs and on different machines some vm and some not. All linux installs, different OSs and java versions.
  • It doesn't appear to be affected by anything prefetch related, that is: changing the prefetch value from 1 to 10 to 1000 didn't stop the issue from happening
  • [red herring?] Enabling debug logs on the amq instance shows logs relating to the periodic check for messages that can be expired. The queue doesn't have an expiry policy so I can only think that the scheduled expireMessagesPeriod time is just waking amq up in such a way that it then sends messages to the non-slow consumers.
  • If the 30s mode is entered then left then entered again the seconds-past-the-minute time is always the same, for example 14s and 44s past the minute. This is true across all consumers and all machines hosting those consumers. Those barrier points do change after restarts of amq.
like image 633
Matt Avatar asked May 22 '14 15:05

Matt


People also ask

How do you handle a slow consumer?

A different approach to handle slow consumers is to drop the client identified as a slow consumer, later allow it to get a snapshot of the current state of the data set from the server, and continue to receive updates from that point onward.

How many messages can ActiveMQ handle?

ActiveMQ has a settings which limits number of messages that can be browsed by a client. By default, it's 400. This setting prevents QueueExplorer to read all (or top 1000, Top 10000, etc) messages from the queue.

Is ActiveMQ push or pull?

ActiveMQ will push as many messages to the consumer as fast as possible, where they will be queued for processing by an ActiveMQ Session. The maximum number of messages that ActiveMQ will push to a Consumer without the Consumer processing a message is set by the pre-fetch size.


1 Answers

While not strictly a solution to the problem, further investigation has uncovered the root cause of this issue.

TL;DR - It's known behaviour and won't be fixed before Apollo

More Details

Ultimately this is caused by the maxPageSize property and the fact that AMQ will only apply selection criteria to messages in memory. Generally these are message selectors (property = value), but in my case they are JMSXGroupID=>Consumer assignments.

As messages are received by the queue they get paged into memory and placed into a collection (named pagedInPendingDispatch in the source). To dispatch messages AMQ will scan through this list of messages and try to find a consumer that will accept it. That includes checking the group id, message selector and prefetch buffer space. For our use case we aren't using message selectors but we are using groups. If no consumer can take the message then it is left in the collection and will be checked again at the next tick.

In order to stop the pagedInPendingDispatch collection from eating up all the resources available there is a suggested limit to the size of this queue configured via the maxPageSize property. This property isn't actually a maximum, it's more a hint as to whether, under normal conditions, new message arrivals should be paged in memory or paged to disk.

With these two pieces of information and a slow consumer it turns out that eventually all the messages in the pagedInPendingDispatch collection end up only being consumable by the slow consumer, and hence the collection effectively gets blocked and no other messages get dispatched. This explains why the slow consumer wasn't affected by the 30s interval, it had maxPageSize messages waiting delivery already.

This doesn't explain why I was seeing the non-slow consumers receive messages every 30s though. As it turns out, paging messages into memory has two modes, normal and forced. Normal follows the process outlined above where the size of the collection is compared to the maxPageSize property, when forced, however, messages are always paged into memory. This mode exists to allow you to browse through messages that aren't in memory. As it happens this forced mode is also used by the expiry mechanism to allow AMQ to expire messages that aren't in memory.

So what we have now is a collection of messages in memory that are all targeted for dispatch to the same consumer, a consumer that won't accept them because it is slow or blocked. We also have a backlog of messages awaiting delivery to all consumers. Every expireMessagesPeriod milliseconds a task runs that force pages messages into memory to check if they should be expired or not. This adds those messages onto the pages in collection which now contains maxPageSize messages for the slow consumer and N more messages destined for any consumer. Those messages get delivered.

QED.

References

  • Ticket referring to this issue but for message selectors instead
  • Docs relating to the configuration properties
  • Somebody else with this issue but for selectors
like image 179
Matt Avatar answered Sep 28 '22 02:09

Matt