Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splittling SQS Lambda batch into partial success/partial failure

The AWS SQS -> Lambda integration allows you to process incoming messages in a batch, where you configure the maximum number you can receive in a single batch. If you throw an exception during processing, to indicate failure, all the messages are not deleted from the incoming queue and can be picked up by another lambda for processing once the visibility timeout has passed.

Is there any way to keep the batch processing, for performance reasons, but allow some messages from the batch to succeed (and be deleted from the inbound queue) and only leave some of the batch un-deleted?

like image 880
matt freake Avatar asked May 21 '19 08:05

matt freake


People also ask

What happens when Lambda fails to process SQS message?

If a Lambda function throws an error, the Lambda service continues to process the failed message until: The message is processed without any error from the function, and the service deletes the message from the queue. The Message retention period is reached and SQS deletes the message from the queue.

Can a Lambda be triggered by multiple SQS?

A Lambda function can process items from multiple queues (using one Lambda event source for each queue). You can use the same queue with multiple Lambda functions.

What is a batch size in SQS Lambda?

Batch size – The number of records to send to the function in each batch. For a standard queue, this can be up to 10,000 records. For a FIFO queue, the maximum is 10.

How does SQS batching work?

When using SQS as a Lambda event source mapping, Lambda functions can be triggered with a batch of messages from SQS. If your function fails to process any message from the batch, the entire batch returns to your SQS queue, and your Lambda function will be triggered with the same batch again.


1 Answers

The problem with manually re-enqueueing the failed messages to the queue is that you can get into an infinite loop where those items perpetually fail and get re-enqueued and fail again. Since they are being resent to the queue their retry count gets reset every time which means they'll never fail out into a dead letter queue. You also lose the benefits of the visibility timeout. This is also bad for monitoring purposes since you'll never be able to know if you're in a bad state unless you go manually check your logs.

A better approach would be to manually delete the successful items and then throw an exception to fail the rest of the batch. The successful items will be removed from the queue, all the items that actually failed will hit their normal visibility timeout periods and retain their receive count values, and you'll be able to actually use and monitor a dead letter queue. This is also overall less work than the other approach.

Considerations

  • Only override the default behavior if there has been a partial batch failure. If all the items succeeded, let the default behavior take its course
  • Since you're tracking the failures of each queue item, you'll need to catch and log each exception as they come in so that you can see what's going on later
like image 139
cdzar Avatar answered Sep 21 '22 21:09

cdzar