Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tomcat startup fails due to 'java.net.SocketException Invalid argument' on Mac OS X

We have an application that runs on Tomcat 6 (6.0.35.0 to be precise), and most of our engineers on Mac OS are having problems starting Tomcat due to the socketAccept call in the Catalina.await method throwing a SocketException:

SEVERE: StandardServer.await: accept: java.net.SocketException: Invalid argument       at java.net.PlainSocketImpl.socketAccept(Native Method)       at java.net.PlainSocketImpl.socketAccept(PlainSocketImpl.java)       at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)       at java.net.ServerSocket.implAccept(ServerSocket.java:522)       at java.net.ServerSocket.accept(ServerSocket.java:490)       at org.apache.catalina.core.StandardServer.await(StandardServer.java:431)       at org.apache.catalina.startup.Catalina.await(Catalina.java:676)       at org.apache.catalina.startup.Catalina.start(Catalina.java:628)       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)       at java.lang.reflect.Method.invoke(Method.java:601)       at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289)       at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414)       at mycompany.tomcat.startup.ThreadDumpWrapper.main(ThreadDumpWrapper.java:260)       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)       at java.lang.reflect.Method.invoke(Method.java:601)       at org.tanukisoftware.wrapper.WrapperStartStopApp.run(WrapperStartStopApp.java:238)       at java.lang.Thread.run(Thread.java:722) 

This causes Tomcat to shut down immediately after startup (and no small amount of rage). We think this has been with us for the duration on Mac OS w/ Java 1.7, in the last several months a lot of us have switched to Macbook Pros. Up until now, the only symptom was occasional zero byte responses from Tomcat, due to this exception also being thrown on a socketRead. Errors don't hit the logs and we'd individually shrugged it off as an isolated problem, and only found the cause when the startup problem started and I set a SocketException breakpoint:

Daemon Thread [http-8080-1] (Suspended (breakpoint at line 47 in SocketException))     SocketException.<init>(String) line: 47    SocketInputStream.socketRead0(FileDescriptor, byte[], int, int, int) line: not available [native method]    SocketInputStream.socketRead0(FileDescriptor, byte[], int, int, int) line: not available     SocketInputStream.read(byte[], int, int, int) line: 150    SocketInputStream.read(byte[], int, int) line: 121     InternalInputBuffer.fill() line: 735     InternalInputBuffer.parseRequestLine() line: 366     Http11Processor.process(Socket) line: 814    Http11Protocol$Http11ConnectionHandler.process(Socket) line: 602     JIoEndpoint$Worker.run() line: 489     Thread.run() line: 722   

For arguments:

arg0  FileDescriptor  (id=499)     fd  1097     useCount  AtomicInteger  (id=503)      value 2  arg1  (id=502) arg2  0  arg3  8192   arg4  20000  

The problem is time sensitive. Increasing startup time due to application changes (lots more Spring introspection/singleton overhead) seems to be the factor that causes this to affect Tomcat startup; the tipping point being about 160 seconds. We can mitigate the problem by disabling some of the non-mandatory contexts we don't need during development to reduce startup time, but I'd prefer to find the root cause.

Application configuration

The specifics of the application are far too complex to go into too much detail, but I have a hunch that this might relate to an earlier bind, so I'll at least list the listening ports on my machine:

localhost:32000 - Java service wrapper port *:10001         - RMI registry *:2322          - Java debug *:56566         - RMI *:8180          - Tomcat HTTP connector *:8543          - Tomcat HTTPS connector *:2223          - Tomcat Internal HTTP connector (used for cross-server requests) *:14131         - 'Locking' port to determine if an internal service is running *:56571         - EhCache RMI *:56573         - RMI *:62616         - ActiveMQ broker *:5001          - SOAPMonitorService *:8109          - Tomcat shutdown port 

Items ruled out

  • The most obvious solution: -Djava.net.preferIPv4Stack=true. I've always had that option configured
  • Any recent configuration change to our base application configuration, libraries, JVM options (there aren't any)
  • A JDK regression. I've tested JDK 1.7.0_09, 11, 15, 17 and 21 (the JDKs I've had installed on my machine for the duration)
  • Mac OS update. Mac OS 10.7.x and 10.8.0 through 1.8.3 are affected
  • File descriptor limits - increased from 5000 to 10000
  • Disabling IPv6 completely on the main ethernet interface
  • Setting breakpoints, and removing the first contexts to be affected by the SocketException (they're outgoing HTTP calls to web services). No change
  • Configuring /etc/hosts so the machine hostname resolves to localhost, and configuring JVM options to prefer IPv4 and to not prefer IPv6 addresses (This answer: https://stackoverflow.com/a/16318860/364206)

For those interested in hosts configuration, it's the same as default. I can reproduce this on a Fusion VM w/ a clean install of 10.8:

## # Host Database # # localhost is used to configure the loopback interface # when the system is booting.  Do not change this entry. ## 127.0.0.1   localhost 255.255.255.255 broadcasthost ::1             localhost fe80::1%lo0 localhost 

Java code investigation

Due to the apparent time sensitive nature of the issue, setting breakpoints to troubleshoot the issue causes it to not occur. As requested in the comments, I also captured arg0 for SocksSocketImpl(PlainSocketImpl).socketAccept(SocketImpl), nothing seems out of the ordinary.

arg0  SocksSocketImpl  (id=460)    address InetAddress  (id=465)      canonicalHostName null       holder  InetAddress$InetAddressHolder  (id=475)        address 0        family  0        hostName  null     applicationSetProxy false    closePending  false    cmdIn null     cmdOut  null     cmdsock null     CONNECTION_NOT_RESET  0    CONNECTION_RESET  2    CONNECTION_RESET_PENDING  1    external_address  null     fd  FileDescriptor  (id=713)       fd  -1       useCount  AtomicInteger  (id=771)        value 0    fdLock  Object  (id=714)     fdUseCount  0    localport 0    port  0    resetLock Object  (id=716)     resetState  0    server  null     serverPort  1080     serverSocket  null     shut_rd false    shut_wr false    socket  Socket  (id=718)       bound false      closed  false      closeLock Object  (id=848)       connected false      created false      impl  null       oldImpl false      shutIn  false      shutOut false    socketInputStream null     stream  false    timeout 0    trafficClass  0    useV4 false  

I think all of the threads where the exceptions are thrown are victims of an earlier call, one that doesn't result in a SocketException so I haven't been able to catch it. Being able to start Tomcat by reducing startup times convinces me that the trigger is probably some scheduled task that performs a socket based operation, which then affects other socket operations.

That doesn't explain how and why this could affect several threads, no matter what we're doing to cause this condition, mysterious SocketExceptions shouldn't bubble up from native code and cause these exceptions simultaneously on multiple threads - that is, two threads making outgoing web service calls, the Tomcat await call, and several TP processor threads repeatedly.

JNI code investigation

Given the generic message, I assumed that an EINVAL error must be returned from one of the system calls in the socketAccept JNI code, so I traced the system calls leading up to the exception; there's no EINVAL returned from any system call. So, I went to the OpenJDK sources looking for conditions in the socketAccept code that would set and then throw an EINVAL, but I also couldn't find any code that sets errno to EINVAL, or calls NET_ThrowByNameWithLastError, NET_ThrowCurrent or NET_ThrowNew in a way that would throw a SocketException with this default error message.

As far as the system calls, we don't seem to get as far as the accept system call:

 PID/THRD        RELATIVE   ELAPSD    CPU SYSCALL(args)    = return  6606/0x2c750d:  221538243       5      0 sigprocmask(0x1, 0x0, 0x14D8BE100)    = 0x0 0  6606/0x2c750d:  221538244       3      0 sigaltstack(0x0, 0x14D8BE0F0, 0x0)     = 0 0  6606/0x2c750d:  221538836      14     10 socket(0x2, 0x1, 0x0)    = 1170 0  6606/0x2c750d:  221538837       3      0 fcntl(0x492, 0x3, 0x4)     = 2 0  6606/0x2c750d:  221538839       3      1 fcntl(0x492, 0x4, 0x6)     = 0 0  6606/0x2c750d:  221538842       5      2 setsockopt(0x492, 0xFFFF, 0x4)     = 0 0  6606/0x2c750d:  221538852       7      4 bind(0x492, 0x14D8BE5D8, 0x10)     = 0 0  6606/0x2c750d:  221538857       5      2 listen(0x492, 0x1, 0x4)    = 0 0  6606/0x2c750d:  221539625       6      2 psynch_cvsignal(0x7FEFBFE00868, 0x10000000200, 0x100)    = 257 0  6606/0x2c750d:  221539633       4      1 write(0x2, "Apr 18, 2013 11:05:35 AM org.apache.catalina.core.StandardServer await\nSEVERE: StandardServer.await: accept: \njava.net.SocketException: Invalid argument\n\tat java.net.PlainSocketImpl.socketAccept(Native Method)\n\tat java.net.PlainSocketImpl.socketAcce", 0x644)    = 1604 0 

So, I think the problem occurs in the timeout handling code at the top of the accept loop in socketAccept, but I couldn't find any case where NET_Timeout would set errno to EINVAL, and result in this SocketException being thrown. I'm referring to this code; I assume the jdk7u branch is for the most part what ships in the Oracle JDK:

  • http://hg.openjdk.java.net/jdk7u/jdk7u/jdk/file/d4bf5c15837c/src/solaris/native/java/net/PlainSocketImpl.c
  • http://hg.openjdk.java.net/jdk7u/jdk7u/jdk/file/d4bf5c15837c/src/solaris/native/java/net/bsd_close.c
  • http://hg.openjdk.java.net/jdk7u/jdk7u/jdk/file/d4bf5c15837c/src/solaris/native/java/net/net_util_md.c

Help!

I can't find anyone in the outside world affected by this particular problem on Mac OS, but almost everyone here is affected. There must be some application configuration that contributes, but I've exhausted every avenue I can think of to find the root cause.

Pointers on troubleshooting or insight on a possible cause would be much appreciated.

like image 967
Danny Thomas Avatar asked Apr 24 '13 11:04

Danny Thomas


People also ask

What causes a Java net SocketException connection reset?

The java. net. SocketException: Connection reset error usually comes when one of the parties in TCP connection like client or server is trying to read/write data, but other parties abruptly close the connection like it was crashed, stopped or terminated.

What is SocketException connection reset?

java.net.SocketException: Connection resetThis SocketException occurs on the server-side when the client closed the socket connection before the response could be returned over the socket. For example, by quitting the browser before the response was retrieved. Connection reset simply means that a TCP RST was received.


2 Answers

Have you tried turning on JNI debugging with -Xcheck:jni? Interestingly the Oracle documentation uses a PlainSocketImpl.socketAccept error as an example of when to use this.

Note also that the implication of Bug 7131399 is that the JNI uses poll() on most platforms but select() on Mac OS due to a problem with poll() on the Mac. So maybe select() is broken too. Digging in a bit further, select() will return EINVAL if "ndfs is greater than FD_SETSIZE and _DARWIN_UNLIMITED_SELECT is not defined." FD_SETSIZE is 1024 and it sounds like you have a ton of applications loading up, so perhaps it all filters down to waiting on more that 1024 FDs at one time.

For extra credit, see if the related (supposedly fixed) Java bug is in fact fixed on your machine. The bug report has pointers to test cases.


Thanks to Old Pro's answer, I confirmed that the select() FD_SETSIZE limitation is the cause. I located an existing bug for this limitation:

https://bugs.openjdk.java.net/browse/JDK-8021820

The problem can be reproduced with the following code:

import java.io.*; import java.net.*;  public class SelectTest {   public static void main(String[] args) throws Exception {     // Use 1024 file descriptors. There'll already be some in use, obviously, but this guarantees the problem will occur     for(int i = 0; i < 1024; i++) {       new FileInputStream("/dev/null");     }     ServerSocket socket = new ServerSocket(8080);     socket.accept();   } } 

Almost a year later, Java 7u60 has a fix this problem:

http://www.oracle.com/technetwork/java/javase/2col/7u60-bugfixes-2202029.html

I also discovered the Tomcat's WebappClassLoader closes file handles after 90 seconds, which explains why setting break points prevented the issue from occurring.

like image 168
9 revs, 2 users 58% Avatar answered Oct 13 '22 00:10

9 revs, 2 users 58%


I had exactly the same issue (with Tomcat7), and what seems to work for me is to tick the "Publish module contexts to separate XML files" option when I'm running tomcat inside Eclipse. Have you tried that already?

like image 35
Clement Avatar answered Oct 13 '22 00:10

Clement