Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

java.io.IOException: error=11

I am experiencing a weird problem with the Java ProcessBuilder. The code is shown below (in a slightly simplified form)

public class Whatever implements Runnable
{

public void run(){
        //someIdentifier is a randomly generated string
        String in = someIdentifier + "input.txt";
        String out = someIdentifier + "output.txt";
        ProcessBuilder builder = new ProcessBuilder("./whateveer.sh", in, out);
        try {
            Process process = builder.start();
            process.waitFor();
        } catch (IOException e) {
            log.error("Could not launch process. Command: " + builder.command(), e);
        } catch (InterruptedException ex) {
            log.error(ex);
        }
}

}

whatever.sh reads:

R --slave --args $1 $2 <whatever1.R >> r.log    

Loads of instances of Whatever are submitted to an ExecutorService of fixed size (35). The rest of the application waits for all of them to finish- implemented with a CountdownLatch. Everything runs fine for several hours (Scientific Linux 5.0, java version "1.6.0_24") before throwing the following exception:

java.io.IOException: Cannot run program "./whatever.sh": java.io.IOException: error=11, Resource temporarily unavailable
    at java.lang.ProcessBuilder.start(Unknown Source)
... rest of stack trace omitted...

Does anyone have an idea what this means? Based on the google/bing search results for java.io.IOException: error=11, it is not the most common of exceptions and I am completely baffled.

My wild and not so educated guess is that I have too many threads trying to launch the same file at the same time. However, it takes hours of CPU time to reproduce the problem, so I have not tried with a smaller number.

Any suggestions are greatly appreciated.

like image 644
mbatchkarov Avatar asked Dec 05 '11 10:12

mbatchkarov


1 Answers

The error=11 is almost certainly the EAGAIN error code:

$ grep EAGAIN asm-generic/errno-base.h 
#define EAGAIN      11  /* Try again */

The clone(2) system call documents an EAGAIN error return:

   EAGAIN Too many processes are already running.

The fork(2) system call documents two EAGAIN error returns:

   EAGAIN fork() cannot allocate sufficient memory to copy the
          parent's page tables and allocate a task structure for
          the child.

   EAGAIN It was not possible to create a new process because
          the caller's RLIMIT_NPROC resource limit was
          encountered.  To exceed this limit, the process must
          have either the CAP_SYS_ADMIN or the CAP_SYS_RESOURCE
          capability.

If you were really that low on memory, it would almost certainly show in the system logs. Check dmesg(1) output or /var/log/syslog for any potential messages about low system memory. (Other things would break. This doesn't seem too plausible.)

Much more likely is running into either the per-user limit on processes or system-wide maximum number of processes. Perhaps one of your processes isn't properly reapting zombies? This would be very easy to spot by checking ps(1) output over time:

while true ; do ps auxw >> ~/processes ; sleep 10 ; done

(Maybe check every minute or ten minutes if it really does take hours before you're in trouble.)

If you're not reaping zombies, then read up on whatever you must do to ProcessBuilder to use waitpid(2) to reap your dead children.

If you're legitimately running more processes than your rlimits allow, you'll need to use ulimit in your bash(1) scripts (if running as root) or set higher limits in /etc/security/limits.conf for the nproc property.

If you are instead running into the system-wide process limits, you might need to write a larger value into /proc/sys/kernel/pid_max. See proc(5) for some (short) details.

like image 171
sarnold Avatar answered Oct 05 '22 03:10

sarnold