Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to debug and fix intermittent SSL 'connection reset by peer' error?

We are having an occasional (1 in 100) error appear on our client (CentOS) when connecting to a server (Windows/IIS) over HTTPS.

The error is: SSL: Connection reset by peer.

Running openssl s_client -connect example.com:443 -prexit works 99% of the time but sometimes returns write:errno=104 confirming the connection reset issue.

Interestingly the handshake is a different (smaller) size when the connection is reset and fails but I cannot see how to actually see the handshake.

A successful connection is: SSL handshake has read 5308 bytes and written 319 bytes

A failed connection is: SSL handshake has read 5249 bytes and written 198 bytes

The same protocol (TLS) and cipher is used at all times.

Server side, the error in Windows Event log is: A fatal alert was generated and sent to the remote endpoint. This may result in termination of the connection. The TLS protocol defined fatal error code is 20. The Windows SChannel error state is 960.

Fatal error code 20 is Received a record with an incorrect MAC. This message is always fatal..

Can anyone help debug this further? As it's only an occasional issue I am struggling to think why it would happen. Thanks!

like image 799
CJD Avatar asked Nov 09 '22 10:11

CJD


1 Answers

Not an application error, but most likely a low level error in the infrastructure. Not specific to SSL but to connection oriented sockets. Packet TTL expiring, network route changing or many others. Well written socket code will alway retry a few times before failing. This is very hard to debug becuase it is often not repeatable over short time periods.

Many years ago this error was making me crazy. Did everything I could to track it down, even wrote a monitor to walk the network graph of the system to make sure each node of the graph was functional and responding properly. About a year later the problem disappeared when a switch on the subnet was replaced. The switch was close to the application not to the nodes on the graph in the datacenter.

like image 151
george Avatar answered Nov 15 '22 09:11

george