I was working with Dead letter Queue in Amazon SQS. I want that whenever a new message is received by the queue it should raise a CloudWatch alarm. The problem is I configured an alarm on the metric: number_of_messages_sent
of the queue but this metric don't work as expected in case of Dead letter Queues as mentioned in the Amazon SQS Dead-Letter Queues - Amazon Simple Queue Service documentation.
Now some suggestions on this were use number_of_messages_visible
but I am not sure how to configure this in an alarm. So if i set that the value of this metric>0
then this is not same as getting a new message in the queue. If an old message is there then the metric value will always be >0
. I can do some kind of mathematical expression to get the delta in this metric for some defined period (let's say a minute) but I am looking for some better solution.
Amazon SQS provides support for dead letter queues. A dead letter queue is a queue that other (source) queues can target for messages that can't be processed successfully. You can set aside and isolate these messages in the dead letter queue to determine why their processing did not succeed.
To specify a dead-letter queue, you can use the console or the AWS SDK for Java. You must do this for each queue that sends messages to a dead-letter queue. Multiple queues of the same type can target a single dead-letter queue.
To process messages on a dead-letter queue (DLQ), MQ supplies a default DLQ handler. The handler matches messages on the DLQ against entries in a rules table that you define. Messages can be put on a DLQ by queue managers, message channel agents (MCAs), and applications.
I used metric math function RATE
to trigger an alarm whenever a message arrives in the dead letter queue.
Select two metrics ApproximateNumberOfMessagesVisible
and ApproximateNumberOfMessagesNotVisible
for your dead letter queue.
Configure the metric expression as RATE(m1+m2)
, set the threshold to 0
and select the comparison operator as GreaterThanThreshold
.
m1+m2
is the total number of messages in the queue at a given time. Whenever a new message arrives in the queue the rate of this expression will go above then zero. That's how it works.
It is difficult to achieve what is being asked in the question. If the endpoint of cloudwatch alarm is to send Email or notify users about the DLQ message arrival you can do a similar thing with the help of SQS, SNS and Lambda. And from cloudwatch you can see how the DLQ messages grows on time whenever you receive any Email.
#!/usr/bin/python3
import json
import boto3
import os
def lambda_handler(event, context):
batch_processes=[]
for record in event['Records']:
send_request(record["body"])
def send_request(body):
# Create SNS client
sns = boto3.client('sns')
# Publish messages to the specified SNS topic
response = sns.publish(
TopicArn=#YOUR_TOPIC_ARN
Message=body,
)
# Print out the response
print(response)
I struggled with the same problem and the answer for me was to use NumberOfMessagesSent instead. Then I could set my criteria for new messages that came in during my configured period of time. Here is what worked for me in CloudFormation.
Note that individual alarms do not occur if the alarm stays in an alarm state from constant failure. You can setup another alarm to catch those. ie: Alarm when 100 errors occur in 1 hour using the same method.
Updated: Because the metrics for NumberOfMessagesReceived and NumberOfMessagesSent are dependent on how the message is queued, I have devised a new solutions for our needs using the metric ApproximateNumberOfMessagesDelayed after adding a delay to the dlq settings. If you are adding the messages to the queue manually then NumberOfMessagesReceived will work. Otherwise use ApproximateNumberOfMessagesDelayed after setting up a delay.
MyDeadLetterQueue:
Type: AWS::SQS::Queue
Properties:
MessageRetentionPeriod: 1209600 # 14 days
DelaySeconds: 60 #for alarms
DLQthresholdAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmDescription: "Alarm dlq messages when we have 1 or more failed messages in 10 minutes"
Namespace: "AWS/SQS"
MetricName: "ApproximateNumberOfMessagesDelayed"
Dimensions:
- Name: "QueueName"
Value:
Fn::GetAtt:
- "MyDeadLetterQueue"
- "QueueName"
Statistic: "Sum"
Period: 300
DatapointsToAlarm: 1
EvaluationPeriods: 2
Threshold: 1
ComparisonOperator: "GreaterThanOrEqualToThreshold"
AlarmActions:
- !Ref MyAlarmTopic
We had the same issue and solved it by using 2 metrics and creating an math expression.
ConsentQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: "queue"
RedrivePolicy:
deadLetterTargetArn:
Fn::GetAtt:
- "DLQ"
- "Arn"
maxReceiveCount: 3 # after 3 tries the event will go to DLQ
VisibilityTimeout: 65
DLQ:
Type: AWS::SQS::Queue
Properties:
QueueName: "DLQ"
DLQAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmDescription: "SQS failed"
AlarmName: "SQSAlarm"
Metrics:
- Expression: "m2-m1"
Id: "e1"
Label: "ChangeInAmountVisible"
ReturnData: true
- Id: "m1"
Label: "MessagesVisibleMin"
MetricStat:
Metric:
Dimensions:
- Name: QueueName
Value: !GetAtt DLQ.QueueName
MetricName: ApproximateNumberOfMessagesVisible
Namespace: "AWS/SQS"
Period: 300 # evaluate maximum over period of 5 min
Stat: Minimum
Unit: Count
ReturnData: false
- Id: "m2"
Label: "MessagesVisibleMax"
MetricStat:
Metric:
Dimensions:
- Name: QueueName
Value: !GetAtt DLQ.QueueName
MetricName: ApproximateNumberOfMessagesVisible
Namespace: "AWS/SQS"
Period: 300 # evaluate maximum over period of 5 min
Stat: Maximum
Unit: Count
ReturnData: false
ComparisonOperator: GreaterThanOrEqualToThreshold
Threshold: 1
DatapointsToAlarm: 1
EvaluationPeriods: 1
The period is important so the minimum and maximum are evaluated over a longer period.
I've encountered the same issue with Cloudwatch Alarms not firing when queue entries automatically flow into a DLQ, and believe I have come up with a solution.
You need to setup:
This should now on a periodic basis, check the difference of number of entries in the DLQ, regardless of how they got there, so we can get past the problematic Metrics like NumberOfMessagesSent or NumberOfMessagesReceived.
UPDATE: I just realised that is the exact solution that Lucasz mentioned above, so consider this a confirmation that it works :)
Terraform working example of above mentions of RATE(M1+M2)
resource "aws_cloudwatch_metric_alarm" "dlq_alarm" {
alarm_name = "alarm_name"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "1"
threshold = "0"
alarm_description = "desc"
insufficient_data_actions = []
alarm_actions = [aws_sns_topic.sns.arn]
metric_query {
id = "e1"
expression = "RATE(m2+m1)"
label = "Error Rate"
return_data = "true"
}
metric_query {
id = "m1"
metric {
metric_name = "ApproximateNumberOfMessagesVisible"
namespace = "AWS/SQS"
period = "60"
stat = "Sum"
unit = "Count"
dimensions = {
QueueName = "${aws_sqs_queue.sqs-dlq.name}"
}
}
}
metric_query {
id = "m2"
metric {
metric_name = "ApproximateNumberOfMessagesNotVisible"
namespace = "AWS/SQS"
period = "60"
stat = "Sum"
unit = "Count"
dimensions = {
QueueName = "${aws_sqs_queue.sqs-dlq.name}"
}
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With