We have three EC2 instances—one in each availability zone (AZ) in the eu-west-1 region. They are loadbalanced using ELB. We'd like to monitor how many instances are registered at the loadbalancer, using CloudWatch. The problem ist: I don't really understand the HealthyHostCount
metric.
For a deployment, we'd like to be able to de-register a single instance (take it out of the LB) without being notified. So the alarm would be: Notify if there is only 1 healthy instance left behind the loadbalancer for 5 minutes.
As far as I understand, HealthyHostCount
(HHC) is the number of healthy instances that are registered with a given ELB, averaged over all AZs. If everything is okay, the HHC should be 1 (no matter over what period of time) because there is 1 instance in each AZ.
A couple of days ago, someone deployed without re-registering the instances, so there was only 1 instance being balanced. When we noticed that, we created an alarm that was to notify us when the average HHC sunk below 0.6 after 5 minutes. (If only 1 instance is registered in ELB, the HHC should average 0.33 for any period of time.) However, the alarm never changed to state "ALARM."
When I checked the HHC in CloudWatch, the HHC were numbers that didn't make sense (sum of 10.0 for a 5-minute interval is all I remember now).
It's all a big mess to me. Any time I think I understand the metric, the CloudWatch charts are all gibberish to me.
Could someone please explain how to use HHC to get an alarm when only 1 instance is registered? Is average HHC the way to go or should I use another metric?
The HealthyHostCount
metric records one data value with the count of available hosts for each availability zone, each time a health check is executed. Your ELB health check has an Interval
parameter that defines how many health checks are executed per minute.
If you are watching a Per-AZ metric, with a health check Interval
of 10 seconds, with 2 healthy hosts in that AZ, you will see 6 data points per minute (60/10
) with a value of 2. The average, max and min will be 2, but the sum will be 6*2=12
.
If you have 3 AZs with 2 hosts each, again with an Interval
=10, but you are looking at the Per-LB metric, you will see 3*6=18
data points per minute, each one with a value of 2. The average, max and min will be 2, but the sum will be 18*2=36
I recommend you to set-up an interval value that can divide 60 seconds (either 5, 6, 10, 15, 20, 30 or 60 seconds).
In your case, if your interval is 30 seconds, and you have 3 AZs and 1 server per AZ: You should expect 2 data points per AZ per minute, so set-up an alarm Per-LB, with a Period
of 1 minute, for Sum of HealthyHostCount
that triggers when value is LowerOrEqual than 2 (2 data values * 1 Healthy AZ * 1 healthy server = 2
, the other 4 data values of the unhealthy AZs should be 0 so they won't affect the sum).
UPDATE:
It turns out that the number of health check executed also depends on the number of internal instances that shapes the ELB (ussually one per AZ), so if you are suffering a traffic spike, or enough load to saturate a single elb-internal-instance, the amount of internal servers inside the ELB will grow and you will have more data points unexpectedly. This may affect the sum
value, only if you have lots of traffic. I didn't saw this issue with a peak load of 6k RPM distributed in 3 AZs. If this is your scenario, then using average
is a safer bet, but I would recommend that you use LowerThan 0.65 as your threshold.
The link also makes me wonder how does the Cross-Zone Load Balancing
feature affects the amount of data points...
This is an area where the CloudWatch web console doesn't expose everything that cloud watch can do. As the docs explain, HealthyHostCount
is a per availability zone metric. The console lets you have HealthHostCount by availability zone (but across all load balancers) or by load balancer (but across all zones) but not sliced both ways.
If you only have one load balancer the simplest thing would be to setup one alarm on each of the per zone metrics. If you have multiple availability zones then you should be able to use the api to create an alarm slicing across availability zone and load balancer (again, one alarm per load balancer) but you can't do this from the web UI as far as I know.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With