Over the course of development of a significantly large project, we've accumulated a lot of unit tests. A lot of these tests start servers, connect to these servers and close the servers and clients, usually in the same process.
However, these tests randomly fail with a "Failed to bind address 127.0.0.1:(port)". When the test is re-run, the error usually disappears.
Now, we thought this was a problem with our tests, but we decided to write a small test in Clojure, which I'll post below (and comment for the non-Clojure people).
(ns test
(:import [java.net Socket ServerSocket]))
(dotimes [n 10000] ; Run the test ten thousand times
(let [server (ServerSocket. 10000) ; Start a server on port 10000
client (Socket. "localhost" 10000) ; Start a client on port 10000
p (.getLocalPort client)] ; Get the local port of the client
(.close client) ; Close the client
(.close server) ; Close the server
(println "n = " n) ; Debug
(println "p = " p) ; Debug
(println "client = " client) ; Debug
(println "server = " server) ; Debug
(let [server (ServerSocket. p)] ; Start a server on the local port of the client we just closed
(.close server) ; Close the server
(println "client = " client) ; Debug
(println "server = " server) ; Debug
))
)
The exception appears, at random, on the line where we start the second server. It appears that Java is holding onto the local port - even though the client on that port has already been closed.
So, my question: Why on earth is Java doing this, and why is it so seemingly random?
EDIT: Someone suggested I set the socket's reuseAddr to true. I've done this, and nothing has changed, so here's the code below.
(ns test
(:import [java.net Socket ServerSocket InetSocketAddress]))
(dotimes [n 10000] ; Run the test ten thousand times
(let [server (ServerSocket. )] ; Create a server socket
(. server (setReuseAddress true)) ; Set the socket to reuse address
(. server (bind (InetSocketAddress. 10000))) ; Bind the socket
(let [client (Socket. "localhost" 10000) ; Start a client on port 10000
p (.getLocalPort client)] ; Get the client's local port
(.close client) ; Close the client
(.close server) ; Close the server
; (. Thread (sleep 1000)) ; A sleep for testing
(println "n = " n) ; Debug
(println "p = " p) ; Debug
(println "client = " client) ; Debug
(println "server = " server) ; Debug
(let [server (ServerSocket. )] ; Create a server socket
(. server (setReuseAddress true)) ; Set the socket to reuse address
(. server (bind (InetSocketAddress. p))) ; Bind the socket to the local port of the client we just had
(.close server) ; Close the server
(println "client = " client) ; Debug
(println "server = " server) ; Debug
)))
)
I've also noticed that a sleep of 10msec or even 100msec does not prevent the problem. 1000msec has (so far) managed to prevent it, however.
EDIT 2: Someone put me on to SO_LINGER - but I can't find a way to set that on the ServerSockets. Anyone have any ideas on that?
EDIT 3: Turns out that SO_LINGER is disabled by default. What else can we look at?
UPDATE: The problem has been solved for the most part, using dynamic port allocation over a range of 10,000 or so ports. However, I'd still like to see what people can come up with.
I'm not (too) with the Clojure syntax, but you should invoke socket.setReuseAddr(true)
. This allows the program to reuse the port, even if there may be sockets in the TIME_WAIT state.
The test itself is invalid. Testing this behaviour is pointless, and has nothing to do with any required application behaviour: it is just exercising a corner condition in the TCP stack, which certainly no application should try to rely on. I would expect that opening a listening socket on a port that had just been an outbound connected port would never succeed at all due to TIME_WAIT, or at best succeed half the time due to uncertainty as to which end issued the close first.
I would remove the test. The rest of it doesn't do anything useful either,
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With