Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS SQS - Queue not delivering any messages until Visibility Timeout expires for one message

Tags:

go

amazon-sqs

EDIT: Solved this one while I was writing it up :P -- I love those kind of solutions. I figured I'd post it anyway, maybe someone else will have the same problem and find my solution. Don't care about points/karma, etc. I just already wrote the whole thing up, so figured I'd post it and the solution.

I have an SQS FIFO queue. It is using a dead letter queue. Here is how it had been configured:

FIFO Queue Configuration

I have a single producer microservice, and I have 10 ECS images that are running as consumers.

It is important that we process the messages close to the time they are delivered in the queue for business reasons.

We're using a fairly recent version of the AWS SDK Golang client package for both producer and consumer code (if important, I can go look up the version, but it is not terribly outdated).

I capture the logs for the producer so I know exactly when messages were put in the queue and what the messages were.

I capture aggregate logs for all the consumers, so I have a full view of all 10 consumers and when messages were received and processed.

Here's what I see under normal conditions looking at the logs:

  1. Message put in the queue at time x
  2. Message received by one of the 10 consumers at time x
  3. Message processed by consumer successfully
  4. Message deleted from queue by consumer at time x + (0-2 seconds)
  5. Repeat ad infinitum for up to about 700 messages / day at various times per day

But the problem I am seeing now is that some messages are not being processed in a timely manner. Occasionally we fail processing a message deliberately b/c of the state of the system for that message (e.g. maybe users still logged in, so it should back off and retry...which it does). The problem is if the consumer fails a message it is causing the queue to stop delivering any other messages to any other consumers.

"Failure to process a message" here just means the message was received, but the consumer declared it a failure, so we just log an error, and do not proceed to delete it from the queue. Thus, the visibility timeout (here 5m) will expire and it will be re-delivered to another consumer and retried up to 10 times, after which it will go to the dead letter queue.

After delving into the logs and analyzing it, here's what I'm seeing:

  1. Process begins like above (message produced, consumed, deleted).
  2. New message received at time x by consumer
  3. Consumer fails -- logs error and just returns (does not delete)
  4. Same message is received again at time x + 5m (visibility timeout)
  5. Consumer fails -- logs error and just returns (does not delete)
  6. Repeat up to 10x -- message goes to dead-letter queue
  7. New message received but it is now 50 minutes late!
  8. Now all messages that were put in the queue between steps 2-7 are 50 minutes late (5m visibility timeout * 10 retries)

All the docs I've read tells me the queue should not behave this way, but I've verified it several times in our logs. Sadly, we don't have a paid AWS support plan, or I'd file a ticket with them. But just consider the fact that we have 10 separate consumers all reading from the same queue. They only read from this queue. We don't have any other queues it is using.

For de-duplication we are using the automated hash of the message body. Messages are small JSON documents.

My expectation would be if we have a single bad message that causes a visibility timeout, that the queue would still happily deliver any other messages it has available while there are available consumers.

like image 449
zenocon Avatar asked Mar 11 '18 20:03

zenocon


1 Answers

OK, so turns out I missed this little nugget of info about FIFO queues in the documentation:

https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/FIFO-queues.html

enter image description here

When you receive a message with a message group ID, no more messages for the same message group ID are returned unless you delete the message or it becomes visible.

I was indeed using the same Message Group ID. Hadn't given it a second thought. Just be aware, if you do that and any one of your messages fails to process, it will back up all other messages in the queue, until the time that the message is finally dealt with. The solution for me was to change the message group id. There is some business logic id I can postfix on it that will work for me.

like image 135
zenocon Avatar answered Sep 28 '22 06:09

zenocon