How to generate an alert when a SQS message is sent to the dead letter queue?

Goal

Aiming to have a CloudWatch Alert triggered when a message from an SQS queue to a lambda function exceeds the maximum retries.

Problem

I presumed that this would be easy and the NumberOfMessagesReceived metric would reflect this. Those familiar with this will know that this is not the case.

Solutions

The 'Limbo' Solution

My quick and easy solution for this problem was the introduce a "Limbo" which acts as the first DLQ and within seconds pushes the message to the final/actual DLQ. In the metrics this results in a spike in the "Limbo" queue's visible messages metric. So having an alert threshold of "> 0" means that every time that queue receives a message an alert can be issued.

However the powers above me are not happy with having a "Limbo" queue for every time we want this functionality.

Screenshot that shows the desired behaviour using the "Limbo" queue here

As far as I have been able to figure out there are some alternative methods but these seem worse than the Limbo Solution.

New Lambda Function

The first is to have a new lambda function that uses a SQS DLQ as a source and generates the alert.

Lambda Runtime Interception

Second is to have the have logic inside the existing lambdas (that process SQS messages) read the amount of times a message has gone been retried and on the final time generate the alert. This kind of removes the advantage of using a queue and a re-drive policy in the first place, and is an over engineered solution.

Metric Maths

The last alternative I can think of is to is to use some Metric Maths to look at the DLQ and calculate if there was been an increase in the last X minutes.

These all seem like strange and overly complex solutions to what (I am convinced) must have a simple implementation. How do I create an alert every time a DLQ receives a message?

895

asked Mar 20 '20 08:03

Jay Cork

2 Answers

I came across this same issue and had success implementing it using Metrics Math. Cloudwatch has a RATE() function which:

"Returns the rate of change of the metric per second. This is calculated as the difference between the latest data point value and the previous data point value, divided by the time difference in seconds between the two values."

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html

So I created an alarm which looks at the rate of change of the ApproximateNumberOfMessagesVisible metric on the Deadletter queue. It goes into alarm when the rate of change is greater than 0. Here is a Cloudformation template example for the alarm:

DeadletterAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties: 
    AlarmName: "DEADLETTER_ALARM"
    ComparisonOperator: GreaterThanThreshold
    EvaluationPeriods: 1
    TreatMissingData: missing
    Threshold: '0'      
    Metrics: 
      - Id: r1
        Expression: RATE(FILL(m1, 0))
        ReturnData: true
      - Id: m1          
        Label: VisibleAverage
        ReturnData: false
        MetricStat:
          Stat: Average
          Period: '300'
          Metric:
            MetricName: ApproximateNumberOfMessagesVisible
            Namespace: AWS/SQS
            Dimensions:
              - Name: QueueName
                Value: "Deadletter_queue_name"

198

answered Oct 27 '22 04:10

Curtis H

One other way to accomplish this is to alarm on ApproximateNumberOfMessagesDelayed. Then you just need to set a delay on your DLQ. So it could look something like this:

MyDLQAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
  AlarmName: MyDLQAlarm
  AlarmDescription: "Alarm when we have 1 or more failed messages in 10 minutes for MyQueue."
  Namespace: "AWS/SQS"
  MetricName: "ApproximateNumberOfMessagesDelayed"
  Dimensions:
    - Name: "QueueName"
      Value:
        Fn::GetAtt:
          - "MyQueue"
          - "QueueName"
  Statistic: "Sum"
  Period: 300
  DatapointsToAlarm: 1
  EvaluationPeriods: 2
  Threshold: 1
  ComparisonOperator: "GreaterThanOrEqualToThreshold"
  AlarmActions:
    - Ref: "SNSTopic"

Then your DLQ can look like:

  MyQueueDLQ:
Type: AWS::SQS::Queue
Properties:
  QueueName: MyQueueDLQ
  MessageRetentionPeriod: 1209600
  DelaySeconds: 60

answered Oct 27 '22 06:10

Scott Sullivan

Related questions
                            
                                upstream timed out (110: Connection timed out) while reading response header from upstream
                            
                                Display Image from S3 with AWS Amplify in React-Native
                            
                                413 Request Entity Too Large in Nginx and Amazon ElasticBeanstalk
                            
                                AWS ECS task healthcheck always failed
                            
                                How to get the log from an application deployed using docker on AWS ecs
                            
                                Cannot set a property of cognito userpool client via cloudformation
                            
                                Get the ARN for the API gateway resource that the serverless framework creates within my serverless.yml file
                            
                                Download folder from Amazon S3 bucket using .net SDK
                            
                                Elasticsearch client on AWS / Lambda / Java - 2.5 seconds client startup time
                            
                                Virtualenv python with AWS codebuild: why the deactivate command is not found?
                            
                                AWS SAM: Shared files across Lambda functions
                            
                                Inject GitLab CI Variables into Terraform Variables
                            
                                Referencing AWS Parameter Store's Secure String in CloudFormation template
                            
                                AWS IoT Policy: subscribe vs receive actions
                            
                                Lambda can't find modules from outer folders when deployed with CDK
                            
                                Django collectstatic not working on production with S3, but same settings work locally
                            
                                Lambda - Import CSV from S3 to RDS MySQL
                            
                                AWS SAM CLI cannot access Dynamo DB when function is invoked locally
                            
                                Terraform - Multiple aws_s3_bucket_notification triggers on the same bucket
                            
                                Definition of $LATEST version of AWS Lambda

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to generate an alert when a SQS message is sent to the dead letter queue?

Tags:

amazon-web-services

amazon-sqs

amazon-cloudwatch

amazon-cloudwatch-metrics

cloudwatch-alarms