I've written the following test code:
@Test
public void testLeakWithGrizzly() throws Throwable {
ExecutorService executor = Executors.newFixedThreadPool(N_THREADS);
Set<Future<Void>> futures = new HashSet<>();
InetSocketAddress inetSocketAddress = new InetSocketAddress(localhostAddress, 111);
for (int i = 0; i < N_THREADS; i++) {
Future<Void> future = executor.submit(new GrizzlyConnectTask(inetSocketAddress, requests, bindFailures, successfulOpens, failedOpens, successfulCloses, failedCloses));
futures.add(future);
}
for (Future<Void> future : futures) {
future.get(); //block
}
Thread.sleep(1000); //let everything calm down
reporter.report();
throw causeOfDeath;
}
private static class GrizzlyConnectTask implements Callable<Void> {
private final InetSocketAddress address;
private final Meter requests;
private final Meter bindFailures;
private final Counter successfulOpens;
private final Counter failedOpens;
private final Counter successfulCloses;
private final Counter failedCloses;
public GrizzlyConnectTask(InetSocketAddress address, Meter requests, Meter bindFailures, Counter successfulOpens, Counter failedOpens, Counter successfulCloses, Counter failedCloses) {
this.address = address;
this.requests = requests;
this.bindFailures = bindFailures;
this.successfulOpens = successfulOpens;
this.failedOpens = failedOpens;
this.successfulCloses = successfulCloses;
this.failedCloses = failedCloses;
}
@Override
public Void call() throws Exception {
while (!die) {
TCPNIOTransport transport = null;
boolean opened = false;
try {
transport = TCPNIOTransportBuilder.newInstance().build();
transport.start();
transport.connect(address).get(); //block
opened = true;
successfulOpens.inc(); //successful open
requests.mark();
} catch (Throwable t) {
//noinspection ThrowableResultOfMethodCallIgnored
Throwable root = getRootCause(t);
if (root instanceof BindException) {
bindFailures.mark(); //ephemeral port exhaustion.
continue;
}
causeOfDeath = t;
die = true;
} finally {
if (!opened) {
failedOpens.inc();
}
if (transport != null) {
try {
transport.shutdown().get(); //block
successfulCloses.inc(); //successful close
} catch (Throwable t) {
failedCloses.inc();
System.err.println("while trying to close transport");
t.printStackTrace();
}
} else {
//no transport == successful close
successfulCloses.inc();
}
}
}
return null;
}
}
on my linux laptop, this crashes in ~5 minutes with the following exception:
java.io.IOException: Too many open files
at sun.nio.ch.EPollArrayWrapper.epollCreate(Native Method)
at sun.nio.ch.EPollArrayWrapper.<init>(EPollArrayWrapper.java:130)
at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:68)
at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36)
at org.glassfish.grizzly.nio.Selectors.newSelector(Selectors.java:62)
at org.glassfish.grizzly.nio.SelectorRunner.create(SelectorRunner.java:109)
at org.glassfish.grizzly.nio.NIOTransport.startSelectorRunners(NIOTransport.java:256)
at org.glassfish.grizzly.nio.NIOTransport.start(NIOTransport.java:475)
at net.radai.LeakTest$GrizzlyConnectTask.call(LeakTest.java:137)
at net.radai.LeakTest$GrizzlyConnectTask.call(LeakTest.java:111)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
the success/fail counters look like this:
-- Counters --------------------------------------------------------------------
failedCloses
count = 0
failedOpens
count = 40999
successfulCloses
count = 177177
successfulOpens
count = 136178
-- Meters ----------------------------------------------------------------------
bindFailures
count = 40998
mean rate = 153.10 events/second
1-minute rate = 144.61 events/second
5-minute rate = 91.12 events/second
15-minute rate = 39.56 events/second
requests
count = 136178
mean rate = 508.54 events/second
1-minute rate = 547.38 events/second
5-minute rate = 442.76 events/second
15-minute rate = 391.53 events/second
which tells me that:
the complete code is up on github here - https://github.com/radai-rosenblatt/oncrpc4j-playground/blob/master/src/test/java/net/radai/LeakTest.java
so, am i somehow misusing the grizzly API, or is this a real leak? (note - im using grizzly 2.3.12, which i know is not the latest. upgrading would require convincing people, which is why i want to be positive this is not user error on my end)
EDIT - this thing leaks even when nothing is thrown. cutting back to a single thread and putting a 2ms sleep in there still leaks 800 pipes over 50 minutes.
I find problem deeply inside grizzly. It is internal multi threading issue (race condition). File descriptors are leaking with sun.nio.ch.EPollSelectorImpl class. Each instance contains 3 file descriptors (2 per pipe and 1 for epoll_create
syscall). Grizzly sends close/shutdown in class SelectorRunner
:
public synchronized void stop() {
stateHolder.set(State.STOPPING);
wakeupSelector();
// we prefer Selector thread shutdown selector
// but if it's not running - do that ourselves.
if (runnerThreadActivityCounter.compareAndSet(0, -1)) {
// The thread is not running
shutdownSelector();
}
}
Usually everything is fine, but sometimes selector never wakes up. Wakeup method sends interrupt by native method sun.nio.ch.EPollArrayWrapper#interrupt(int)
. It has simple implementation:
JNIEXPORT void JNICALL
Java_sun_nio_ch_EPollArrayWrapper_interrupt(JNIEnv *env, jobject this, int fd)
{
int fakebuf[1];
fakebuf[0] = 1;
if (write(fd, fakebuf, 1) < 0) {
JNU_ThrowIOExceptionWithLastError(env,"write to interrupt fd failed");
}
}
So it just sends one byte to wake up waiting selector. But you close transport immediately after creation. It is rarely situation in real life, but it happens regular in your test case. Sometimes grizzly calls NIOConnection.enableIOEvent
after closing and wakeup/interrupt. I think in this case selectors never wake up and never free file descriptors.
Currently I could suggest only hotfix for this situation: use timer task for direct call selector.close
after some timeout:
//hotfix code bellow
private static final Timer timer = new Timer();
//hotfix code above
protected synchronized void stopSelectorRunners() {
if (selectorRunners == null) {
return;
}
for (int i = 0; i < selectorRunners.length; i++) {
SelectorRunner runner = selectorRunners[i];
if (runner != null) {
runner.stop();
//hotfix code below
final Selector selector = runner.getSelector();
if(selector !=null) {
timer.schedule(new TimerTask() {
@Override
public void run() {
try {
selector.close();
} catch (IOException e) {
}
}
}, 100);
}
//hotfix code above
selectorRunners[i] = null;
}
}
selectorRunners = null;
}
I can stop leaks after adding this to org.glassfish.grizzly.nio.NIOTransport#stopSelectorRunners
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With