Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does AWS Cloudwatch use an Evaluation Range when determining alarm state with missing data points?

From the docs:

No matter what value you set for how to treat missing data, when an alarm evaluates whether to change state, CloudWatch attempts to retrieve a higher number of data points than specified by Evaluation Periods. The exact number of data points it attempts to retrieve depends on the length of the alarm period and whether it is based on a metric with standard resolution or high resolution. The timeframe of the data points that it attempts to retrieve is the evaluation range.

The docs go on to give an example of an alarm with 'EvaluationPeriods' and 'DatapointsToAlarm' set to 3. They state that Cloudwatch chooses the 5 most recent datapoints. Part of my question is, Where are they getting 5? It's not clear from the docs.

The second part of my question is, why have this behavior at all (or at least, why have it by default)? If I set my evaluation period to 3, my Datapoints to Alarm to 3, and tell Cloudwatch to 'TreatMissingData' as 'breaching,' I'm going to expect 3 periods of missing data to trigger an alarm state. This doesn't necessarily happen, as illustrated by an example in the docs.

like image 925
user3658800 Avatar asked Nov 08 '18 21:11

user3658800


People also ask

How alarm state is evaluated when data is missing?

How alarm state is evaluated when data is missing. Whenever an alarm evaluates whether to change state, CloudWatch attempts to retrieve a higher number of data points than the number specified as Evaluation Periods.

Why would an Amazon CloudWatch alarm report as insufficient data instead of OK or alarm?

If your CloudWatch alarm is in the INSUFFICIENT_DATA, it can indicate any one of the following reasons: An Amazon CloudWatch alarm has just started. The metric is not available. There is insufficient data for the metric to establish the alarm state.

Why does CloudWatch show insufficient data?

Because the data points are not successfully being delivered to CloudWatch, the alarm can't retrieve any data points for those evaluation periods. This triggers an INSUFFICIENT_DATA state. After recovering connectivity, the application sends the backlog of data points, each one with its own timestamp.

What is the purpose of AWS CloudWatch alarms?

The new CloudWatch Alarms feature allows you to watch CloudWatch metrics and to receive notifications when the metrics fall outside of the levels (high or low thresholds) that you configure. You can attach multiple Alarms to each metric and each one can have multiple actions.


2 Answers

I also agree that this behavior is unexpected, and the fact that you can't configure it is quite frustrating. However, there does seem to be an easy workaround depending on your use case.

I also wanted the same behavior as you specified; i.e. a missing data point is a breaching data point plain and simple:

If I set my evaluation period to 3, my Datapoints to Alarm to 3, and tell Cloudwatch to 'TreatMissingData' as 'breaching,' I'm going to expect 3 periods of missing data to trigger an alarm state.

I had a use case which is basically like a push-style health monitor. We needed a particular on-premises service to report a "healthy" metric daily to CloudWatch, and an alarm in case this report didn't come through due to network issues or anything disruptive. Semantically, missing data is the same as reporting a metric of value 0 (the "healthy" metric is value 1).

So I was able to use metric math's FILL function to replace every missing data point with 0. Setting a 1-out-of-1, alarm on <1 alarm on this new expression provides exactly the needed behavior without involving any kind of "missing data".

like image 192
Michael Avatar answered Oct 14 '22 07:10

Michael


I had the same questions. As best I can tell, the 5 can be explained if I am thinking about standard collection intervals vs standard resolution correctly. In other words, if we assume a standard collection interval of 5 minutes and a standard 1-minute resolution, then within the 5 minutes of the collection interval, 5 separate data points are collected. The example states you need 3 data points over 3 evaluation periods, which is less than the 5 data points CloudWatch has collected. CloudWatch would then have all the data points it needs within the 5-data-point evaluation range defined by a single collection. As an example, if 4 of the 5 expected data points are missing from the collection, you have one defined data point and thus need 2 more within the evaluation range to reach the three needed for alarm evaluation. These 2 (not the 4 that are actually missing from the collection) are considered the "missing" data points in the documentation - I find this confusing. The tables in the AWS documentation provide examples for how the different treatments of the "missing" 2 data points impact the alarm evaluations.

Regardless of whether this is the correct interpretation, this could be better explained in the documentation.

like image 36
Ulises Llull Avatar answered Oct 14 '22 07:10

Ulises Llull