Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java TCP/IP Socket Performance Problem

Our application is reading data very fast over TCP/IP sockets in Java. We are using the NIO library with a non-blocking Sockets and a Selector to indicate readiness to read. On average, the overall processing times for reading and handling the read data is sub-millisecond. However we frequently see spikes of 10-20 milliseconds. (running on Linux).

Using tcpdump we can see the time difference between tcpdump's reading of 2 discreet messages, and compare that with our applications time. We see tcpdump seems to have no delay, whereas the application can show 20 milliseconds.

We are pretty sure this is not GC, because the GC log shows virtually no Full GC, and in JDK 6 (from what I understand) the default GC is parallel, so it should not be pausing the application threads (unless doing Full GC).

It looks almost as if there is some delay for Java's Selector.select(0) method to return the readiness to read, because at the TCP layer, the data is already available to be read (and tcpdump is reading it).

Additional Info: at peak load we are processing about 6,000 x 150 bytes avg per message, or about 900 MB per second.

like image 339
Sam Goldberg Avatar asked Mar 14 '11 18:03

Sam Goldberg


People also ask

Why is my TCP not reliable?

TCP provides a reliability at the transport layer only. This means only that it will make sure that data loss gets detected (and packets re-transmitted), data duplication gets detected (and duplicates discarded) and packet reordering gets detected and fixed. TCP does not claim any reliability of higher layers.

Do java sockets use TCP?

Yes, Socket and ServerSocket use TCP/IP. The package overview for the java.net package is explicit about this, but it's easy to overlook. UDP is handled by the DatagramSocket class.

How does java handle lost connection?

In order to notice that a connection is lost, you have to send some data on the application level. (*) You can try this out by unplugging the phone cable from your ADSL modem. All connections in your PC should stay up, unless the applications have some kind of application level keepalive mechanism.


4 Answers

eden collection still incurs a STW pause so 20ms may be perfectly normal depending on allocation behaviour & heap size/size of the live set.

like image 190
Matt Avatar answered Oct 26 '22 08:10

Matt


Is your Java code running under RTLinux, or some other distro with hard real-time scheduling capability? If not, 10-20 msec of jitter in the processing times seems completely reasonable, and expected.

like image 36
Jim Lewis Avatar answered Oct 26 '22 09:10

Jim Lewis


I had the same problem in a java service that I work on. When sending the same request repeatedly from the client the server would block at the same spot in stream for 25-35ms. Turning off Nagle's algorithm in the socket fixed this for me. This can be accomplished by calling setTcpNoDelay(true) on the Socket. This may result in increased network congestion because ACKs will now be sent as separate packets. See http://en.wikipedia.org/wiki/Nagle%27s_algorithm for more info on Nagle's algorithm.

like image 37
Adam Henderson Avatar answered Oct 26 '22 10:10

Adam Henderson


From the tcpdump faq:

WHEN IS A PACKET TIME-STAMPED? HOW ACCURATE ARE THE TIME STAMPS?

In most OSes on which tcpdump and libpcap run, the packet is time stamped as part of the process of the network interface's device driver, or the networking stack, handling it. This means that the packet is not time stamped at the instant that it arrives at the network interface; after the packet arrives at the network interface, there will be a delay until an interrupt is delivered or the network interface is polled (i.e., the network interface might not interrupt the host immediately - the driver may be set up to poll the interface if network traffic is heavy, to reduce the number of interrupts and process more packets per interrupt), and there will be a further delay between the point at which the interrupt starts being processed and the time stamp is generated.

So odds are, the timestamp is made in the privileged kernel layer, and the lost 20ms is to context-switching overhead back to user-space and into Java and the JVMs network selector logic. Without more analysis of the system as a whole I don't think it's possible to make an affirmative selection of cause.

like image 36
Chris K Avatar answered Oct 26 '22 08:10

Chris K