Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate an alert when a SQS message is sent to the dead letter queue?

Goal

Aiming to have a CloudWatch Alert triggered when a message from an SQS queue to a lambda function exceeds the maximum retries.

Problem

I presumed that this would be easy and the NumberOfMessagesReceived metric would reflect this. Those familiar with this will know that this is not the case.

Solutions

The 'Limbo' Solution

My quick and easy solution for this problem was the introduce a "Limbo" which acts as the first DLQ and within seconds pushes the message to the final/actual DLQ. In the metrics this results in a spike in the "Limbo" queue's visible messages metric. So having an alert threshold of "> 0" means that every time that queue receives a message an alert can be issued.

However the powers above me are not happy with having a "Limbo" queue for every time we want this functionality.

Screenshot that shows the desired behaviour using the "Limbo" queue here

As far as I have been able to figure out there are some alternative methods but these seem worse than the Limbo Solution.

New Lambda Function

The first is to have a new lambda function that uses a SQS DLQ as a source and generates the alert.

Lambda Runtime Interception

Second is to have the have logic inside the existing lambdas (that process SQS messages) read the amount of times a message has gone been retried and on the final time generate the alert. This kind of removes the advantage of using a queue and a re-drive policy in the first place, and is an over engineered solution.

Metric Maths

The last alternative I can think of is to is to use some Metric Maths to look at the DLQ and calculate if there was been an increase in the last X minutes.

These all seem like strange and overly complex solutions to what (I am convinced) must have a simple implementation. How do I create an alert every time a DLQ receives a message?

like image 895
Jay Cork Avatar asked Mar 20 '20 08:03

Jay Cork


People also ask

How do I monitor a dead-letter queue?

How can I monitor and log dead-letter queues? You can use Amazon CloudWatch metrics to monitor dead-letter queues associated with your Amazon SNS subscriptions. All Amazon SQS queues emit CloudWatch metrics at one-minute intervals.

What happens to messages in dead-letter queue?

Dead-letter queues are also used at the sending end of a channel, for data-conversion errors.. Every queue manager in a network typically has a local queue to be used as a dead-letter queue so that messages that cannot be delivered to their correct destination can be stored for later retrieval.

What is a dead-letter queue in SQS?

When the ReceiveCount for a message exceeds the maxReceiveCount for a queue, Amazon SQS moves the message to a dead-letter queue (with its original message ID).

What is the purpose of the dead letter queue?

The dead-letter queue The purpose of the dead-letter queue is to hold messages that can't be delivered to any receiver, or messages that couldn't be processed. Messages can then be removed from the DLQ and inspected.

What is dead letter queue in Amazon SNS?

A dead-letter queue is an Amazon SQS queue that an Amazon SNS subscription can target for messages that can't be delivered to subscribers successfully. Messages that can't be delivered due to client errors or server errors are held in the dead-letter queue for further analysis or reprocessing.

Can I use an encrypted Amazon SQS queue as a dead-letter queue?

For a FIFO topic, use an Amazon SQS FIFO queue as a dead-letter queue for the Amazon SNS subscrption. To use an encrypted Amazon SQS queue as a dead-letter queue, you must use a custom CMK with a key policy that grants the Amazon SNS service principal access to AWS KMS API actions.


2 Answers

I came across this same issue and had success implementing it using Metrics Math. Cloudwatch has a RATE() function which:

"Returns the rate of change of the metric per second. This is calculated as the difference between the latest data point value and the previous data point value, divided by the time difference in seconds between the two values."

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html

So I created an alarm which looks at the rate of change of the ApproximateNumberOfMessagesVisible metric on the Deadletter queue. It goes into alarm when the rate of change is greater than 0. Here is a Cloudformation template example for the alarm:

DeadletterAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties: 
    AlarmName: "DEADLETTER_ALARM"
    ComparisonOperator: GreaterThanThreshold
    EvaluationPeriods: 1
    TreatMissingData: missing
    Threshold: '0'      
    Metrics: 
      - Id: r1
        Expression: RATE(FILL(m1, 0))
        ReturnData: true
      - Id: m1          
        Label: VisibleAverage
        ReturnData: false
        MetricStat:
          Stat: Average
          Period: '300'
          Metric:
            MetricName: ApproximateNumberOfMessagesVisible
            Namespace: AWS/SQS
            Dimensions:
              - Name: QueueName
                Value: "Deadletter_queue_name"
like image 198
Curtis H Avatar answered Oct 27 '22 04:10

Curtis H


One other way to accomplish this is to alarm on ApproximateNumberOfMessagesDelayed. Then you just need to set a delay on your DLQ. So it could look something like this:

MyDLQAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
  AlarmName: MyDLQAlarm
  AlarmDescription: "Alarm when we have 1 or more failed messages in 10 minutes for MyQueue."
  Namespace: "AWS/SQS"
  MetricName: "ApproximateNumberOfMessagesDelayed"
  Dimensions:
    - Name: "QueueName"
      Value:
        Fn::GetAtt:
          - "MyQueue"
          - "QueueName"
  Statistic: "Sum"
  Period: 300
  DatapointsToAlarm: 1
  EvaluationPeriods: 2
  Threshold: 1
  ComparisonOperator: "GreaterThanOrEqualToThreshold"
  AlarmActions:
    - Ref: "SNSTopic"

Then your DLQ can look like:

  MyQueueDLQ:
Type: AWS::SQS::Queue
Properties:
  QueueName: MyQueueDLQ
  MessageRetentionPeriod: 1209600
  DelaySeconds: 60
like image 22
Scott Sullivan Avatar answered Oct 27 '22 06:10

Scott Sullivan