Aiming to have a CloudWatch Alert triggered when a message from an SQS queue to a lambda function exceeds the maximum retries.
I presumed that this would be easy and the NumberOfMessagesReceived metric would reflect this. Those familiar with this will know that this is not the case.
My quick and easy solution for this problem was the introduce a "Limbo" which acts as the first DLQ and within seconds pushes the message to the final/actual DLQ. In the metrics this results in a spike in the "Limbo" queue's visible messages metric. So having an alert threshold of "> 0" means that every time that queue receives a message an alert can be issued.
However the powers above me are not happy with having a "Limbo" queue for every time we want this functionality.
As far as I have been able to figure out there are some alternative methods but these seem worse than the Limbo Solution.
The first is to have a new lambda function that uses a SQS DLQ as a source and generates the alert.
Second is to have the have logic inside the existing lambdas (that process SQS messages) read the amount of times a message has gone been retried and on the final time generate the alert. This kind of removes the advantage of using a queue and a re-drive policy in the first place, and is an over engineered solution.
The last alternative I can think of is to is to use some Metric Maths to look at the DLQ and calculate if there was been an increase in the last X minutes.
These all seem like strange and overly complex solutions to what (I am convinced) must have a simple implementation. How do I create an alert every time a DLQ receives a message?
How can I monitor and log dead-letter queues? You can use Amazon CloudWatch metrics to monitor dead-letter queues associated with your Amazon SNS subscriptions. All Amazon SQS queues emit CloudWatch metrics at one-minute intervals.
Dead-letter queues are also used at the sending end of a channel, for data-conversion errors.. Every queue manager in a network typically has a local queue to be used as a dead-letter queue so that messages that cannot be delivered to their correct destination can be stored for later retrieval.
When the ReceiveCount for a message exceeds the maxReceiveCount for a queue, Amazon SQS moves the message to a dead-letter queue (with its original message ID).
The dead-letter queue The purpose of the dead-letter queue is to hold messages that can't be delivered to any receiver, or messages that couldn't be processed. Messages can then be removed from the DLQ and inspected.
A dead-letter queue is an Amazon SQS queue that an Amazon SNS subscription can target for messages that can't be delivered to subscribers successfully. Messages that can't be delivered due to client errors or server errors are held in the dead-letter queue for further analysis or reprocessing.
For a FIFO topic, use an Amazon SQS FIFO queue as a dead-letter queue for the Amazon SNS subscrption. To use an encrypted Amazon SQS queue as a dead-letter queue, you must use a custom CMK with a key policy that grants the Amazon SNS service principal access to AWS KMS API actions.
I came across this same issue and had success implementing it using Metrics Math. Cloudwatch has a RATE() function which:
"Returns the rate of change of the metric per second. This is calculated as the difference between the latest data point value and the previous data point value, divided by the time difference in seconds between the two values."
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
So I created an alarm which looks at the rate of change of the ApproximateNumberOfMessagesVisible metric on the Deadletter queue. It goes into alarm when the rate of change is greater than 0. Here is a Cloudformation template example for the alarm:
DeadletterAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: "DEADLETTER_ALARM"
ComparisonOperator: GreaterThanThreshold
EvaluationPeriods: 1
TreatMissingData: missing
Threshold: '0'
Metrics:
- Id: r1
Expression: RATE(FILL(m1, 0))
ReturnData: true
- Id: m1
Label: VisibleAverage
ReturnData: false
MetricStat:
Stat: Average
Period: '300'
Metric:
MetricName: ApproximateNumberOfMessagesVisible
Namespace: AWS/SQS
Dimensions:
- Name: QueueName
Value: "Deadletter_queue_name"
One other way to accomplish this is to alarm on ApproximateNumberOfMessagesDelayed
. Then you just need to set a delay on your DLQ. So it could look something like this:
MyDLQAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: MyDLQAlarm
AlarmDescription: "Alarm when we have 1 or more failed messages in 10 minutes for MyQueue."
Namespace: "AWS/SQS"
MetricName: "ApproximateNumberOfMessagesDelayed"
Dimensions:
- Name: "QueueName"
Value:
Fn::GetAtt:
- "MyQueue"
- "QueueName"
Statistic: "Sum"
Period: 300
DatapointsToAlarm: 1
EvaluationPeriods: 2
Threshold: 1
ComparisonOperator: "GreaterThanOrEqualToThreshold"
AlarmActions:
- Ref: "SNSTopic"
Then your DLQ can look like:
MyQueueDLQ:
Type: AWS::SQS::Queue
Properties:
QueueName: MyQueueDLQ
MessageRetentionPeriod: 1209600
DelaySeconds: 60
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With