I have an alert that I have configured to send email when the sum of executions of cloud functions that have finished in status other than 'error' or 'ok' is above 0 (grouped by the function name).
The way I defined the alert is:
And the secondary aggregator is delta
.
The problem is that once the alert is open, it looks like the filters don't matter any more, and the alert stays open because it sees that the cloud function is triggered and finishes with any status (even 'ok' status keeps it open as long as its triggered enough).
ATM the only solution I can think of is to define a log based metric that will count it itself and then the alert will be based on that custom metric instead of on the built in one.
Is there something that I'm missing?
Edit:
Adding another image to show what I think might be the problem:
From the image above we see that the graph wont go down to 0 but will stay at 1, which is not the way other normal incidents work
What is not a recommended best practice for alerts? []Configure alerting on symptoms and not necessarily causes. []Use multiple notification channels so you avoid a single point of failure.
You can add a log-based alerting policy to your Google Cloud project by using the Logs Explorer in Cloud Logging or by using the Monitoring API. This content does not apply to log-based alerting policies. For information about log-based alerting policies, see Monitoring your logs.
The alignment period is a look-back interval from a particular point in time; the aligner is the function that combines the points into the look-back interval into an aligned value.
Key Features: The system includes an autodiscovery feature and that is able to detect Google Compute Engine (GCE), Google App Engine (GAE), Google Kubernetes Engine, VPC, Cloud IAM, Cloud Audit Logging, Cloud SQL, and BigQuery instances, among others. The system can also track AWS and Azure services.
According to the official documentation:
"Monitoring automatically closes an incident when it observes that the condition is no longer met or when 7 days have passed without an observation that the condition is still being met."
That made me think that there are times where the condition is not relevant to make it close the incident. Which is confirmed here:
"If measurements are missing (for example, if there are no HTTP requests for a couple of minutes), the policy uses the last recorded value to evaluate conditions."
The lack of HTTP requests aren't a reason to close the metric as it keeps using the last recorded value (that triggered the metric).
So, using alerts for Http Requests is fine but you need to close them by yourself. Although I think it would be better to use a custom metric instead if you want them to be disabled automatically.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With