Strange CloudWatch alarm behaviour

Tags:

I have a backup script that runs every 2 hours. I want to use CloudWatch to track the successful executions of this script and CloudWatch's Alarms to get notified whenever the script runs into problems.

The script puts a data point on a CloudWatch metric after every successful backup:

    mon-put-data --namespace Backup --metric-name $metric --unit Count --value 1

I have an alarm that goes to ALARM state whenever the statistic "Sum" on the metric is less than 2 in a 6-hour period.

In order to test this setup, after a day, I stopped putting data in the metric (ie, I commented out the mon-put-data command). Good, eventually the alarm went to ALARM state and I got an email notification, as expected.

The problem is that, some time later, the alarm wen back to the OK state, however there's no new data being added to the metric!

The two transitions (OK => ALARM, then ALARM => OK) have been logged and I reproduce the logs in this question. Note that, although both show "period: 21600" (ie, 6h), the second one shows a 12-hour time span between startDate and queryDate; I see that this might explain the transition, but I cannot understand why CloudWatch is considering a 12-hour time span to calculate a statistic with a 6-hour period!

What am I missing here? How to configure the alarms to achieve what I want (ie, get notified if backups are not being made)?

{
    "Timestamp": "2013-03-06T15:12:01.069Z",
    "HistoryItemType": "StateUpdate",
    "AlarmName": "alarm-backup-svn",
    "HistoryData": {
        "version": "1.0",
        "oldState": {
            "stateValue": "OK",
            "stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (3.0).",
            "stateReasonData": {
                "version": "1.0",
                "queryDate": "2013-03-05T21:12:44.081+0000",
                "startDate": "2013-03-05T15:12:00.000+0000",
                "statistic": "Sum",
                "period": 21600,
                "recentDatapoints": [
                    3
                ],
                "threshold": 3
            }
        },
        "newState": {
            "stateValue": "ALARM",
            "stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
            "stateReasonData": {
                "version": "1.0",
                "queryDate": "2013-03-06T15:12:01.052+0000",
                "startDate": "2013-03-06T09:12:00.000+0000",
                "statistic": "Sum",
                "period": 21600,
                "recentDatapoints": [
                    1
                ],
                "threshold": 2
            }
        }
    },
    "HistorySummary": "Alarm updated from OK to ALARM"
}

The second one, which I simple cannot understand:

{
    "Timestamp": "2013-03-06T17:46:01.063Z",
    "HistoryItemType": "StateUpdate",
    "AlarmName": "alarm-backup-svn",
    "HistoryData": {
        "version": "1.0",
        "oldState": {
            "stateValue": "ALARM",
            "stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
            "stateReasonData": {
                "version": "1.0",
                "queryDate": "2013-03-06T15:12:01.052+0000",
                "startDate": "2013-03-06T09:12:00.000+0000",
                "statistic": "Sum",
                "period": 21600,
                "recentDatapoints": [
                    1
                ],
                "threshold": 2
            }
        },
        "newState": {
            "stateValue": "OK",
            "stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (2.0).",
            "stateReasonData": {
                "version": "1.0",
                "queryDate": "2013-03-06T17:46:01.041+0000",
                "startDate": "2013-03-06T05:46:00.000+0000",
                "statistic": "Sum",
                "period": 21600,
                "recentDatapoints": [
                    3
                ],
                "threshold": 2
            }
        }
    },
    "HistorySummary": "Alarm updated from ALARM to OK"
}

354

asked Mar 06 '13 21:03

Bruno Reis

1 Answers

This behavior (that your monitor did not transition into the INSFUCCIENT_DATA state is because Cloudwatch considers 'pre-timestamped' metric datapoints and so (for a 6 hour alarm) if no data exists in the current 6 open hour window .. it will take data from the previous 6 hour window (hence the 12 hour timestamp you see above).

To increase the 'fidelity' of your alarm, reduce the alarm period down to 1 Hour/3600s and increase your number of evaluation periods to how many periods you want to alarm on failure for. That will ensure your alarm transitions into INSFUCCIENT_DATA as you expect.

How to configure the alarms to achieve what I want (ie, get notified if backups are not being made)?

A possible architecture for your alarm would be publish 1 if your job is successful, 0 if it failed. Then create an alarm with a threshold of < 1 for 3 - 3600s periods meaning that your alarm will go into ALARM if the job is failing (i.e running .. but failing). If you also set an INSFUCCIENT_DATA action on that alarm then you will also get notified if your job is not running at all.

Hope that makes sense .

answered Sep 29 '22 05:09

Wal

Related questions
                            
                                Can I insert text from another file into my cloudformation template?
                            
                                Accessing raw url using AWS API Gateway
                            
                                CloudFront Distribution with S3 Origin Responds with XML ListBucketResult
                            
                                API Gateway Authorizer Accepts ID_token But Not Access_Token
                            
                                Enable cloudwatch logs for kinesis firehose cloudformation
                            
                                AWS Lambda: Ruby function failing to load gem
                            
                                AWS Lambda Console - Upgrade boto3 version
                            
                                Change Streams with Amazon DocumentDB
                            
                                Send graphQL mutation based on DynamoDb stream with NodeJS
                            
                                How can I delete old files by name in S3 bucket?
                            
                                MissingAuthenticationTokenException ("Missing Authentication Token") from CognitoIdentityProviderClient::adminCreateUser()
                            
                                How to query AWS DynamoDB using multiple Indexes?
                            
                                What are the implications of "Do you want to allow your PC to be discoverable by other PCs and devices on this network?" on AWS EC2?
                            
                                How to sort by date(createdAt) on a field in list query in aws-amplify?
                            
                                How to increase AWS Sagemaker invocation time out while waiting for a response
                            
                                Amazon S3 POST upload (from iPhone)
                            
                                Using user defined parameters to control auto-scaling of AWS ELB instances
                            
                                How to save media files on AWS with multiple EC2 instances on AWS
                            
                                SQS: How can I read the sent time of an SQS message using Python's boto library
                            
                                Can't find the DB security group in Amazon RDS console

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Strange CloudWatch alarm behaviour

Tags:

amazon-web-services

amazon-cloudwatch

Bruno Reis

People also ask

1 Answers

Wal

Recent Activity

Donate For Us