Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Configure SQS Dead letter Queue to raise a cloud watch alarm on receiving a message

I was working with Dead letter Queue in Amazon SQS. I want that whenever a new message is received by the queue it should raise a CloudWatch alarm. The problem is I configured an alarm on the metric: number_of_messages_sent of the queue but this metric don't work as expected in case of Dead letter Queues as mentioned in the Amazon SQS Dead-Letter Queues - Amazon Simple Queue Service documentation.

Now some suggestions on this were use number_of_messages_visible but I am not sure how to configure this in an alarm. So if i set that the value of this metric>0 then this is not same as getting a new message in the queue. If an old message is there then the metric value will always be >0. I can do some kind of mathematical expression to get the delta in this metric for some defined period (let's say a minute) but I am looking for some better solution.

like image 914
Mayank Bajaj Avatar asked Feb 13 '20 15:02

Mayank Bajaj


People also ask

What is a dead-letter queue in SQS?

Amazon SQS provides support for dead letter queues. A dead letter queue is a queue that other (source) queues can target for messages that can't be processed successfully. You can set aside and isolate these messages in the dead letter queue to determine why their processing did not succeed.

How do I process a dead-letter queue in AWS?

To specify a dead-letter queue, you can use the console or the AWS SDK for Java. You must do this for each queue that sends messages to a dead-letter queue. Multiple queues of the same type can target a single dead-letter queue.

How do you process messages in dead-letter queue?

To process messages on a dead-letter queue (DLQ), MQ supplies a default DLQ handler. The handler matches messages on the DLQ against entries in a rules table that you define. Messages can be put on a DLQ by queue managers, message channel agents (MCAs), and applications.


6 Answers

I used metric math function RATE to trigger an alarm whenever a message arrives in the dead letter queue.

Select two metrics ApproximateNumberOfMessagesVisible and ApproximateNumberOfMessagesNotVisible for your dead letter queue.

Configure the metric expression as RATE(m1+m2), set the threshold to 0 and select the comparison operator as GreaterThanThreshold.

m1+m2 is the total number of messages in the queue at a given time. Whenever a new message arrives in the queue the rate of this expression will go above then zero. That's how it works.

like image 70
Lokesh Avatar answered Oct 31 '22 04:10

Lokesh


It is difficult to achieve what is being asked in the question. If the endpoint of cloudwatch alarm is to send Email or notify users about the DLQ message arrival you can do a similar thing with the help of SQS, SNS and Lambda. And from cloudwatch you can see how the DLQ messages grows on time whenever you receive any Email.

  1. Create a SQS DLQ for an existing queue.
  2. Create an SNS topic and subscribe the SNS topic to send Email.
  3. Create a small lambda function which listens the SQS queue for an incoming messages, if there is any new incoming messages, send it to SNS. Since SNS is subscribed to Email you will get the Email whenever any new messages comes to SQS queue. Obviously the trigger for the lambda function is SQS and batch size is 1.
#!/usr/bin/python3
import json
import boto3
import os

def lambda_handler(event, context):
    batch_processes=[]
    for record in event['Records']:
        send_request(record["body"])


def send_request(body):
    # Create SNS client
    sns = boto3.client('sns')

    # Publish messages to the specified SNS topic
    response = sns.publish(
        TopicArn=#YOUR_TOPIC_ARN
        Message=body,    
    )

    # Print out the response
    print(response)
like image 39
deepanmurugan Avatar answered Oct 31 '22 04:10

deepanmurugan


I struggled with the same problem and the answer for me was to use NumberOfMessagesSent instead. Then I could set my criteria for new messages that came in during my configured period of time. Here is what worked for me in CloudFormation.

Note that individual alarms do not occur if the alarm stays in an alarm state from constant failure. You can setup another alarm to catch those. ie: Alarm when 100 errors occur in 1 hour using the same method.

Updated: Because the metrics for NumberOfMessagesReceived and NumberOfMessagesSent are dependent on how the message is queued, I have devised a new solutions for our needs using the metric ApproximateNumberOfMessagesDelayed after adding a delay to the dlq settings. If you are adding the messages to the queue manually then NumberOfMessagesReceived will work. Otherwise use ApproximateNumberOfMessagesDelayed after setting up a delay.

MyDeadLetterQueue:
    Type: AWS::SQS::Queue
    Properties:
      MessageRetentionPeriod: 1209600  # 14 days
      DelaySeconds: 60 #for alarms

DLQthresholdAlarm:
 Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmDescription: "Alarm dlq messages when we have 1 or more failed messages in 10 minutes"
      Namespace: "AWS/SQS"
      MetricName: "ApproximateNumberOfMessagesDelayed"
      Dimensions:
        - Name: "QueueName"
          Value:
            Fn::GetAtt:
              - "MyDeadLetterQueue"
              - "QueueName"
      Statistic: "Sum"
      Period: 300  
      DatapointsToAlarm: 1 
      EvaluationPeriods: 2       
      Threshold: 1
      ComparisonOperator: "GreaterThanOrEqualToThreshold"
      AlarmActions:
        - !Ref MyAlarmTopic
like image 21
alex2017 Avatar answered Oct 31 '22 05:10

alex2017


We had the same issue and solved it by using 2 metrics and creating an math expression.

    ConsentQueue:
        Type: AWS::SQS::Queue
        Properties:
            QueueName: "queue"
            RedrivePolicy:
                deadLetterTargetArn:
                    Fn::GetAtt:
                        - "DLQ"
                        - "Arn"
                maxReceiveCount: 3 # after 3 tries the event will go to DLQ
             VisibilityTimeout: 65
    DLQ:
        Type: AWS::SQS::Queue
        Properties:
            QueueName: "DLQ"

    DLQAlarm:
        Type: AWS::CloudWatch::Alarm
        Properties:
            AlarmDescription: "SQS failed"
            AlarmName: "SQSAlarm"
            Metrics:
            - Expression: "m2-m1"
              Id: "e1"
              Label: "ChangeInAmountVisible"
              ReturnData: true
            - Id: "m1"
              Label: "MessagesVisibleMin"
              MetricStat:
                  Metric:
                      Dimensions:
                      - Name: QueueName
                        Value: !GetAtt DLQ.QueueName
                      MetricName: ApproximateNumberOfMessagesVisible
                      Namespace: "AWS/SQS"
                  Period: 300 # evaluate maximum over period of 5 min
                  Stat: Minimum
                  Unit: Count
              ReturnData: false
            - Id: "m2"
              Label: "MessagesVisibleMax"
              MetricStat:
                  Metric:
                      Dimensions:
                      - Name: QueueName
                        Value: !GetAtt DLQ.QueueName
                      MetricName: ApproximateNumberOfMessagesVisible
                      Namespace: "AWS/SQS"
                  Period: 300 # evaluate maximum over period of 5 min
                  Stat: Maximum
                  Unit: Count
              ReturnData: false
            ComparisonOperator: GreaterThanOrEqualToThreshold
            Threshold: 1
            DatapointsToAlarm: 1
            EvaluationPeriods: 1

The period is important so the minimum and maximum are evaluated over a longer period. AWS Math Expression Graph

like image 37
Lucasz Avatar answered Oct 31 '22 06:10

Lucasz


I've encountered the same issue with Cloudwatch Alarms not firing when queue entries automatically flow into a DLQ, and believe I have come up with a solution.

You need to setup:

  • Consider a time period, for me I set up 5 minutes
  • Add a metric via the SQS collection for the dlq you need, and select "ApproximateNumberOfMessagesVisible". Set the statistics to Maximum.
  • Duplicate the above line, and set the statistics to Minimum.
  • Add a new empty expression Metric where the details are: (the id of maximum metric) - (the id of the minimum metric)
  • Make sure you only tick and click "Select Metric" for the new expression you created above.

This should now on a periodic basis, check the difference of number of entries in the DLQ, regardless of how they got there, so we can get past the problematic Metrics like NumberOfMessagesSent or NumberOfMessagesReceived.

UPDATE: I just realised that is the exact solution that Lucasz mentioned above, so consider this a confirmation that it works :)

like image 33
Steely77 Avatar answered Oct 31 '22 06:10

Steely77


Terraform working example of above mentions of RATE(M1+M2)

resource "aws_cloudwatch_metric_alarm" "dlq_alarm" {
  alarm_name                = "alarm_name"
  comparison_operator       = "GreaterThanThreshold"
  evaluation_periods        = "1"
  threshold                 = "0"
  alarm_description         = "desc"
  insufficient_data_actions = []
  alarm_actions = [aws_sns_topic.sns.arn]

  metric_query {
    id          = "e1"
    expression  = "RATE(m2+m1)"
    label       = "Error Rate"
    return_data = "true"
  }

  metric_query {
    id = "m1"

    metric {
      metric_name = "ApproximateNumberOfMessagesVisible"
      namespace                 = "AWS/SQS"
      period      = "60"
      stat        = "Sum"
      unit        = "Count"

      dimensions = {
        QueueName    = "${aws_sqs_queue.sqs-dlq.name}"
      }
    }
  }

  metric_query {
    id = "m2"

    metric {
      metric_name = "ApproximateNumberOfMessagesNotVisible"
      namespace                 = "AWS/SQS"
      period      = "60"
      stat        = "Sum"
      unit        = "Count"

      dimensions = {
        QueueName    = "${aws_sqs_queue.sqs-dlq.name}"
      }
    }
  }
}
like image 41
Goyat Parmod Avatar answered Oct 31 '22 05:10

Goyat Parmod