The documentation for the various client/target/elb reset count metrics (TCP_Client_Reset_Count
, TCP_Target_Reset_Count
, TCP_ELB_Reset_Count
) just says they count RST packets. I tried to understand what a RST packet is, and it seems to have to do with broken TCP connections. My load balancer has a single, long-term, seemingly successful client connection. Why do I see on the order of 100 client resets per hour? I also see about 10 load balancer resets per hour, and 0 target resets.
EDIT: I just observed that increasing the size of the server instance (I'm using Farscape--increased 0.25 vCPU to 0.5) led to a 10-fold reduction in client resets per hour. The number of load balancer resets did not change.
The total number of reset (RST) packets sent from a target to a client. These resets are generated by the target and forwarded by the load balancer.
All the load balancers of a ELB registers their IP addresses on the DNS service at Amazon's side. So for different queries, Amazon will return different IP addresses. This is why ELB only has a DNS name instead of a static IP address.
Elastic Load Balancing sets the idle timeout value for UDP flows to 120 seconds.
My hunch is that this is related to a bug in the Network Load Balancer that causes it to send 100x as many health checks as it should. See: NLB Target Group health checks are out of control My theory is that a bug causes the health check connection to be broken in an unclean way if the target instance is not quick enough. These broken health check connections get reported as "client resets" even though they should be reported as "ELB resets" or not reported at all.
There are many reasons for an TCP RST to be sent. Some are not normal, meaning errors, and some are normal connection cleanups that the TCP/IP stack or application performs.
An example of a normal TCP RST would be a long lived connection that exceeds some time limit imposed by one side or the other. Once the time limit is exceeded the connection can be "forceably" closed which will generate the RST.
An example of a not normal TCP RST would be an application that abruptly disconnected due to an internal error.
A poorly written application can also cause TCP RST when it does not perform graceful shutdowns on the TCP socket before closing the connection.
I will guess that the behavior you are seeing is not a problem. However, to really know, you will need to do a wire trace and protocol analysis on each connection to determine exactly what is happening.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With