Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strange CloudWatch alarm behaviour

I have a backup script that runs every 2 hours. I want to use CloudWatch to track the successful executions of this script and CloudWatch's Alarms to get notified whenever the script runs into problems.

The script puts a data point on a CloudWatch metric after every successful backup:

    mon-put-data --namespace Backup --metric-name $metric --unit Count --value 1

I have an alarm that goes to ALARM state whenever the statistic "Sum" on the metric is less than 2 in a 6-hour period.

In order to test this setup, after a day, I stopped putting data in the metric (ie, I commented out the mon-put-data command). Good, eventually the alarm went to ALARM state and I got an email notification, as expected.

The problem is that, some time later, the alarm wen back to the OK state, however there's no new data being added to the metric!

The two transitions (OK => ALARM, then ALARM => OK) have been logged and I reproduce the logs in this question. Note that, although both show "period: 21600" (ie, 6h), the second one shows a 12-hour time span between startDate and queryDate; I see that this might explain the transition, but I cannot understand why CloudWatch is considering a 12-hour time span to calculate a statistic with a 6-hour period!

What am I missing here? How to configure the alarms to achieve what I want (ie, get notified if backups are not being made)?

{
    "Timestamp": "2013-03-06T15:12:01.069Z",
    "HistoryItemType": "StateUpdate",
    "AlarmName": "alarm-backup-svn",
    "HistoryData": {
        "version": "1.0",
        "oldState": {
            "stateValue": "OK",
            "stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (3.0).",
            "stateReasonData": {
                "version": "1.0",
                "queryDate": "2013-03-05T21:12:44.081+0000",
                "startDate": "2013-03-05T15:12:00.000+0000",
                "statistic": "Sum",
                "period": 21600,
                "recentDatapoints": [
                    3
                ],
                "threshold": 3
            }
        },
        "newState": {
            "stateValue": "ALARM",
            "stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
            "stateReasonData": {
                "version": "1.0",
                "queryDate": "2013-03-06T15:12:01.052+0000",
                "startDate": "2013-03-06T09:12:00.000+0000",
                "statistic": "Sum",
                "period": 21600,
                "recentDatapoints": [
                    1
                ],
                "threshold": 2
            }
        }
    },
    "HistorySummary": "Alarm updated from OK to ALARM"
}

The second one, which I simple cannot understand:

{
    "Timestamp": "2013-03-06T17:46:01.063Z",
    "HistoryItemType": "StateUpdate",
    "AlarmName": "alarm-backup-svn",
    "HistoryData": {
        "version": "1.0",
        "oldState": {
            "stateValue": "ALARM",
            "stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
            "stateReasonData": {
                "version": "1.0",
                "queryDate": "2013-03-06T15:12:01.052+0000",
                "startDate": "2013-03-06T09:12:00.000+0000",
                "statistic": "Sum",
                "period": 21600,
                "recentDatapoints": [
                    1
                ],
                "threshold": 2
            }
        },
        "newState": {
            "stateValue": "OK",
            "stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (2.0).",
            "stateReasonData": {
                "version": "1.0",
                "queryDate": "2013-03-06T17:46:01.041+0000",
                "startDate": "2013-03-06T05:46:00.000+0000",
                "statistic": "Sum",
                "period": 21600,
                "recentDatapoints": [
                    3
                ],
                "threshold": 2
            }
        }
    },
    "HistorySummary": "Alarm updated from ALARM to OK"
}
like image 354
Bruno Reis Avatar asked Mar 06 '13 21:03

Bruno Reis


People also ask

What actions can I take from a CloudWatch alarm?

Using Amazon CloudWatch alarm actions, you can create alarms that automatically stop, terminate, reboot, or recover your EC2 instances. You can use the stop or terminate actions to help you save money when you no longer need an instance to be running.

What is anomaly detection in CloudWatch?

When you enable anomaly detection for a metric, CloudWatch applies statistical and machine learning algorithms. These algorithms continuously analyze metrics of systems and applications, determine normal baselines, and surface anomalies with minimal user intervention. The algorithms generate an anomaly detection model.

How many different statistics can be set by a CloudWatch alarm?

A math expression used for an alarm can include as many as 10 metrics. Each metric must be using the same period.

How do I enable CloudWatch anomaly detection?

To enable Anomaly Detection on the metric you select the “anomaly detection” icon of your graphed metric as seen below. Anomaly Detection uses up to two weeks of historical data for training. For the best result, at least three days of data is recommended.


1 Answers

This behavior (that your monitor did not transition into the INSFUCCIENT_DATA state is because Cloudwatch considers 'pre-timestamped' metric datapoints and so (for a 6 hour alarm) if no data exists in the current 6 open hour window .. it will take data from the previous 6 hour window (hence the 12 hour timestamp you see above).

To increase the 'fidelity' of your alarm, reduce the alarm period down to 1 Hour/3600s and increase your number of evaluation periods to how many periods you want to alarm on failure for. That will ensure your alarm transitions into INSFUCCIENT_DATA as you expect.

How to configure the alarms to achieve what I want (ie, get notified if backups are not being made)?

A possible architecture for your alarm would be publish 1 if your job is successful, 0 if it failed. Then create an alarm with a threshold of < 1 for 3 - 3600s periods meaning that your alarm will go into ALARM if the job is failing (i.e running .. but failing). If you also set an INSFUCCIENT_DATA action on that alarm then you will also get notified if your job is not running at all.

Hope that makes sense .

like image 89
Wal Avatar answered Sep 29 '22 05:09

Wal