Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haproxy + netty: Way to prevent exceptions on connection reset?

We're using haproxy in front of a netty-3.6-run backend. We are handling a huge number of connections, some of which can be longstanding.

Now the problem is that when haproxy closes a connection for means of rebalancing, it does so by sending a tcp-RST. When the sun.nio.ch-class employed by netty sees this, it throws an IOException: "Connection reset by peer".

Trace:

sun.nio.ch.FileDispatcherImpl.read0(Native Method):1 in ""
sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39):1 in ""
sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:225):1 in ""
sun.nio.ch.IOUtil.read(IOUtil.java:193):1 in ""
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:375):1 in ""
org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64):1 in ""
org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109):1 in ""
org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312):1 in ""
org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90):1 in ""
org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178):1 in ""
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145):1 in ""
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615):1 in ""
java.lang.Thread.run(Thread.java:724):1 in ""

This causes the following problems per configuration:

option http-pretend-keepalive

This is what works best (as haproxy seems to close most connections with a FIN rather than RST), but still produces about 3 exceptions per server per second. Also, it effectively neuters loadbalancing, because some incoming connections are very longstanding whith very high throughput: with pretend-keepalive, they never get rebalanced to another server by haproxy.

option http-keep-alive

Since our backend expects keep-alive connections to really be kept alive (and hence does not close them on its own), this setting amounts to every connection eventually netting one exception, which in turn crashes our servers. We tried adding prefer-last-server, but it doesn't help much.

option http-server-close

This should theoretically work for both proper loadbalancing and no exceptions. However, it seems that after our backend-servers respond, there is a race as to which side sends its RST first: haproxy or our registered ChannelFutureListener.CLOSE. In practice, we still get too many exceptions and our servers crash.

Interestingly, the exceptions generally get more, the more workers we supply our channels with. I guess it speeds up reading more than writing.

Anyways, I've read up on the different channel- and socketoptions in netty as well as haproxy for a while now and didn't really find anything that sounded like a solution (or worked when I tried it).

like image 739
Benjaminssp Avatar asked Feb 04 '14 10:02

Benjaminssp


3 Answers

Note : As per my understanding, You don't have to worry about Connection Reset Exception's, unless you've a Connection Pooling at your end with Keep-Alive Connections.

I faced a similar issue with lots of Connection Reset (RST) (It used to be 5-20times in a window of 10seconds, based on load) while using HAProxy for our services.
This is how I fixed it.

We had a system where connections are always kept-alive (keep-alive is always true at HTTP connection level. i.e., Once a connection is established, we reuse this connection from HTTP Connection pool for subsequent calls instead of creating new ones.)

Now, As per my debugging in Code and TCP Dump I found RST's were thrown from HAProxy in below scenario's

  1. When HAProxy's timeout client or timeout server had reached, on an Idle Connection.
    This configuration was set as 60seconds for us. Since we have a pool of connections, when the load on server decreases it would result in some of these connections not getting used for a minute.
    So these connection's were then closed by HAProxy using a RST Signal.

  2. When HAProxy's option prefer-last-server was not set.
    As per the Docs:

The real use is for keep-alive connections sent to servers. When this option is used, haproxy will try to reuse the same connection that is attached to the server instead of rebalancing to another server, causing a close of the connection.

Since this was not set, everytime a connection was re-used from the pool, HAProxy used to Close this connection using RST Signal and create a new one to a different server (As our Load Balancer was set to round-robin). This was messing up and rendering the entire Connection Pooling useless.

So the Configuration that worked Fine:

  1. option prefer-last-server : So existing Connections to a server will be re-used.
    Note: This will NOT cause the Load balancer to use previous server over new server for a new connection. The decision making for new connections is always based on the load balancing algorithm. This option is only for an existing connection which was already alive between a client and a server.
    When I tested with this option, new connection was still going to a server2 even though the connection before this was sent to server1.
  2. balance leastconn : With Round robin and Keep-Alive, there could be skewing of connections to a single server. (Say there are just 2 servers and when One server goes down due to deployment, then all new connections will start going to the other server. So even when server2 comes up, round-robin would still allocate new requests one to server1 and one to server2 alternatively. In spite of server1 having a lot of connections at its end. So the Server's load is never exactly Balanced.).
  3. Setting HAProxy's timeout client or timeout server to 10minutes. This increased amount of time our connections could stay idle.
  4. Implemented an IdleConnectionMonitor : With the timeout being set to 10m, the chances of RST from HAProxy was reduced but not eliminated.
    To remove it completely, we added a IdleConnectionMonitor which was responsible for closing connections which was idle for more than 9Minutes.


With these configurations, we could

  • Eliminate the Connection Reset
  • Get Connection Pooling working
  • Ensured the load balancing happens evenly across Servers no matter what time they start.

Hope this helps!!

like image 162
Kishore Bandi Avatar answered Nov 13 '22 06:11

Kishore Bandi


The Tomcat Nio-handler just does:

} catch (java.net.SocketException e) {
    // SocketExceptions are normal
    Http11NioProtocol.log.debug
        (sm.getString
         ("http11protocol.proto.socketexception.debug"), e);

} catch (java.io.IOException e) {
    // IOExceptions are normal
    Http11NioProtocol.log.debug

        (sm.getString
         ("http11protocol.proto.ioexception.debug"), e);

}

So it seems like the initial throw by the internal sun-classes (sun.nio.ch.FileDispatcherImpl) really is inevitable unless you reimplement them yourself.

like image 22
Benjaminssp Avatar answered Nov 13 '22 06:11

Benjaminssp


Try with

  • option http-tunnel
  • no option redispatch

not sure of the redispatch, but http-tunnel fixed the issue on our end.

like image 1
Rajashekhar S Choukimath Avatar answered Nov 13 '22 05:11

Rajashekhar S Choukimath