Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java NIO selector minimum possible latency

I am doing some benchmarks with an optimized Java NIO selector on Linux over loopback (127.0.0.1).

My test is very simple:

  • One program sends an UDP packet to another program that echoes it back to the sender and the round trip time is computed. The next packet is only sent when the previous one is acked (when it returns). A proper warm up with a couple of millions messages is conducted before the benchmark is performed. The message has 13-bytes (not counting UDP headers).

For the round trip time I get the following results:

  • Min time: 13 micros
  • Avg time: 19 micros
  • 75% percentile: 18,567 nanos
  • 90% percentile: 18,789 nanos
  • 99% percentile: 19,184 nanos
  • 99.9% percentile: 19,264 nanos
  • 99.99% percentile: 19,310 nanos
  • 99.999% percentile: 19,322 nanos

But the catch here is that I am spinning 1 million messages.

If I spin only 10 messages I get very different results:

  • Min time: 41 micros
  • Avg time: 160 micros
  • 75% percentile: 150,701 nanos
  • 90% percentile: 155,274 nanos
  • 99% percentile: 159,995 nanos
  • 99.9% percentile: 159,995 nanos
  • 99.99% percentile: 159,995 nanos
  • 99.999% percentile: 159,995 nanos

Correct me if I am wrong, but I suspect that once we get the NIO selector spinning the response times become optimum. However if we are sending messages with a large enough interval between them, we pay the price of waking up the selector.

If I play around with sending just a single message I get various times between 150 and 250 micros.

So my questions for the community are:

1 - Is my minimum time of 13 micros with average of 19 micros optimum for this round trip packet test. It looks like I am beating ZeroMQ by far so I may be missing something here. From this benchmark it looks like ZeroMQ has a 49 micros avg time (99% percentile) on a standard kernel => http://www.zeromq.org/results:rt-tests-v031

2 - Is there anything I can do to improve the selector reaction time when I spin a single or very few messages? 150 micros does not look good. Or should I assume that on a prod environment the selector will not be quite?


By doing busy spinning around selectNow() I am able to get better results. Sending few packets is still worse than sending many packets, but I think I am now hitting the selector performance limit. My results:

  • Sending a single packet I get a consistent 65 micros round trip time.
  • Sending two packets I get around 39 micros round trip time on average.
  • Sending 10 packets I get around 17 micros round trip time on average.
  • Sending 10,000 packets I get around 10,098 nanos round trip time on average.
  • Sending 1 million packets I get 9,977 nanos round trip time on average.

Conclusions

  • So it looks like the physical barrier for the UDP packet round trip is an average of 10 microseconds although I got some packets making the trip in 8 micros (min time).

  • With busy spinning (thanks Peter) I was able to go from 200 micros on average to a consistent 65 micros on average for a single packet.

  • Not sure why ZeroMQ is 5 times slower than that. (Edit: Maybe because I am testing this on the same machine through loopback and ZeroMQ is using two different machines?)

like image 706
Julie Avatar asked Aug 23 '12 20:08

Julie


People also ask

What is Java NIO Selector?

The Java NIO Selector is a component which can examine one or more Java NIO Channel instances, and determine which channels are ready for e.g. reading or writing. This way a single thread can manage multiple channels, and thus multiple network connections.

How does NIO work in Java?

Java NIO enables you to do non-blocking IO. For instance, a thread can ask a channel to read data into a buffer. While the channel reads data into the buffer, the thread can do something else. Once data is read into the buffer, the thread can then continue processing it.

How selector works in Java?

select. Selects a set of keys whose corresponding channels are ready for I/O operations. This method performs a blocking selection operation. It returns only after at least one channel is selected, this selector's wakeup method is invoked, or the current thread is interrupted, whichever comes first.


1 Answers

You often see cases there waking a thread can be very expensive, not just because it takes time for the thread to wake up, but the thread runs 2-5x slower for tens of micro-seconds afterwards as the caches and

The way I have avoided this in the past is to busy wait. Unfortunately selectNow creates a new collection every time you call it even if it is an empty collection. This generates so much garbage its not worth using.

One way around this it to busy wait on non-blocking sockets. This doesn't scale particularly well but can give you the lowest latency as the thread doesn't need to wake and the code you run after this is more likely to be in cache. If you use thread affinity as well, it can reduce your threads disturbance.

What I would also suggest is trying to make your code lock less and garbage less. If you do this you can can have a process in Java which sends a response to an incoming packet under 100 micro-seconds 90% of the time. This would allow you to process each packet at 100 Mb as they arrive (up to 145 micro-seconds apart due to bandwidth limitations) For a 1 Gb connection you can get pretty close.


If you want fast interprocess communication on the same box in Java, you could consider something like https://github.com/peter-lawrey/Java-Chronicle This uses shared memory to pass messages with round trip latencies (which is harder to do efficiently with sockets) of less than 200 nano-seconds. It also persists the data and is useful if you just want a fast way to produce a journal file.

like image 160
Peter Lawrey Avatar answered Oct 17 '22 03:10

Peter Lawrey