Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grizzly pipe leak - what am i doing wrong?

Tags:

java

grizzly

I've written the following test code:

@Test
public void testLeakWithGrizzly() throws Throwable {
    ExecutorService executor = Executors.newFixedThreadPool(N_THREADS);
    Set<Future<Void>> futures = new HashSet<>();
    InetSocketAddress inetSocketAddress = new InetSocketAddress(localhostAddress, 111);
    for (int i = 0; i < N_THREADS; i++) {
        Future<Void> future = executor.submit(new GrizzlyConnectTask(inetSocketAddress, requests, bindFailures, successfulOpens, failedOpens, successfulCloses, failedCloses));
        futures.add(future);
    }
    for (Future<Void> future : futures) {
        future.get(); //block
    }
    Thread.sleep(1000); //let everything calm down
    reporter.report();
    throw causeOfDeath;
}    
private static class GrizzlyConnectTask implements Callable<Void> {
    private final InetSocketAddress address;
    private final Meter requests;
    private final Meter bindFailures;
    private final Counter successfulOpens;
    private final Counter failedOpens;
    private final Counter successfulCloses;
    private final Counter failedCloses;

    public GrizzlyConnectTask(InetSocketAddress address, Meter requests, Meter bindFailures, Counter successfulOpens, Counter failedOpens, Counter successfulCloses, Counter failedCloses) {
        this.address = address;
        this.requests = requests;
        this.bindFailures = bindFailures;
        this.successfulOpens = successfulOpens;
        this.failedOpens = failedOpens;
        this.successfulCloses = successfulCloses;
        this.failedCloses = failedCloses;
    }

    @Override
    public Void call() throws Exception {
        while (!die) {
            TCPNIOTransport transport = null;
            boolean opened = false;
            try {
                transport = TCPNIOTransportBuilder.newInstance().build();
                transport.start();
                transport.connect(address).get(); //block
                opened = true;
                successfulOpens.inc(); //successful open
                requests.mark();
            } catch (Throwable t) {
                //noinspection ThrowableResultOfMethodCallIgnored
                Throwable root = getRootCause(t);
                if (root instanceof BindException) {
                    bindFailures.mark(); //ephemeral port exhaustion.
                    continue;
                }
                causeOfDeath = t;
                die = true;
            } finally {
                if (!opened) {
                    failedOpens.inc();
                }
                if (transport != null) {
                    try {
                        transport.shutdown().get(); //block
                        successfulCloses.inc(); //successful close
                    } catch (Throwable t) {
                        failedCloses.inc();
                        System.err.println("while trying to close transport");
                        t.printStackTrace();
                    }
                } else {
                    //no transport == successful close
                    successfulCloses.inc();
                }
            }
        }
        return null;
    }
}

on my linux laptop, this crashes in ~5 minutes with the following exception:

java.io.IOException: Too many open files
    at sun.nio.ch.EPollArrayWrapper.epollCreate(Native Method)
    at sun.nio.ch.EPollArrayWrapper.<init>(EPollArrayWrapper.java:130)
    at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:68)
    at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36)
    at org.glassfish.grizzly.nio.Selectors.newSelector(Selectors.java:62)
    at org.glassfish.grizzly.nio.SelectorRunner.create(SelectorRunner.java:109)
    at org.glassfish.grizzly.nio.NIOTransport.startSelectorRunners(NIOTransport.java:256)
    at org.glassfish.grizzly.nio.NIOTransport.start(NIOTransport.java:475)
    at net.radai.LeakTest$GrizzlyConnectTask.call(LeakTest.java:137)
    at net.radai.LeakTest$GrizzlyConnectTask.call(LeakTest.java:111)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

the success/fail counters look like this:

-- Counters --------------------------------------------------------------------
failedCloses
             count = 0
failedOpens
             count = 40999
successfulCloses
             count = 177177
successfulOpens
             count = 136178

-- Meters ----------------------------------------------------------------------
bindFailures
             count = 40998
         mean rate = 153.10 events/second
     1-minute rate = 144.61 events/second
     5-minute rate = 91.12 events/second
    15-minute rate = 39.56 events/second
requests
             count = 136178
         mean rate = 508.54 events/second
     1-minute rate = 547.38 events/second
     5-minute rate = 442.76 events/second
    15-minute rate = 391.53 events/second

which tells me that:

  • there were no close failures
  • all connections either failed to create or were successfully closed (136178 + 40999 = 177177)
  • all failures to open were ephemeral port exhaustion except for the last one (40999 = 40998 + 1)

the complete code is up on github here - https://github.com/radai-rosenblatt/oncrpc4j-playground/blob/master/src/test/java/net/radai/LeakTest.java

so, am i somehow misusing the grizzly API, or is this a real leak? (note - im using grizzly 2.3.12, which i know is not the latest. upgrading would require convincing people, which is why i want to be positive this is not user error on my end)

EDIT - this thing leaks even when nothing is thrown. cutting back to a single thread and putting a 2ms sleep in there still leaks 800 pipes over 50 minutes.

like image 819
radai Avatar asked Aug 15 '15 13:08

radai


1 Answers

I find problem deeply inside grizzly. It is internal multi threading issue (race condition). File descriptors are leaking with sun.nio.ch.EPollSelectorImpl class. Each instance contains 3 file descriptors (2 per pipe and 1 for epoll_create syscall). Grizzly sends close/shutdown in class SelectorRunner:

    public synchronized void stop() {
        stateHolder.set(State.STOPPING);
        wakeupSelector();

        // we prefer Selector thread shutdown selector
        // but if it's not running - do that ourselves.
        if (runnerThreadActivityCounter.compareAndSet(0, -1)) {
             // The thread is not running
            shutdownSelector();
        }
    }

Usually everything is fine, but sometimes selector never wakes up. Wakeup method sends interrupt by native method sun.nio.ch.EPollArrayWrapper#interrupt(int). It has simple implementation:

JNIEXPORT void JNICALL
Java_sun_nio_ch_EPollArrayWrapper_interrupt(JNIEnv *env, jobject this, int fd)
{
    int fakebuf[1];
    fakebuf[0] = 1;
    if (write(fd, fakebuf, 1) < 0) {
        JNU_ThrowIOExceptionWithLastError(env,"write to interrupt fd failed");
    }
}

So it just sends one byte to wake up waiting selector. But you close transport immediately after creation. It is rarely situation in real life, but it happens regular in your test case. Sometimes grizzly calls NIOConnection.enableIOEvent after closing and wakeup/interrupt. I think in this case selectors never wake up and never free file descriptors.

Currently I could suggest only hotfix for this situation: use timer task for direct call selector.close after some timeout:

//hotfix code bellow
private static final Timer timer = new Timer();
//hotfix code above
protected synchronized void stopSelectorRunners() {
    if (selectorRunners == null) {
        return;
    }

    for (int i = 0; i < selectorRunners.length; i++) {
        SelectorRunner runner = selectorRunners[i];
        if (runner != null) {
            runner.stop();
            //hotfix code below
            final Selector selector = runner.getSelector();
            if(selector !=null) {
                timer.schedule(new TimerTask() {
                    @Override
                    public void run() {
                        try {
                            selector.close();
                        } catch (IOException e) {
                        }
                    }
                }, 100);
            }
            //hotfix code above
            selectorRunners[i] = null;
        }
    }

    selectorRunners = null;
}

I can stop leaks after adding this to org.glassfish.grizzly.nio.NIOTransport#stopSelectorRunners

like image 172
sibnick Avatar answered Sep 22 '22 20:09

sibnick