I have a (java) application that runs in a low latency environment, it typically processes instructions in ~600micros (+/- 100). Naturally as we've moved further into the microsecond space the things you see that cost latency change and right now we've noticed that 2/3 of that time is spent in the allocation of the 2 core domain objects.
Benchmarking has isolated the offending sections of code to literally the construction of the objects from existing references, i.e. basically a load of references (~15 in each class) and a couple of lists new'ed, though see note below on exactly what is measured here.
Each one consistently takes ~100micros which is inexplicable to me and I'm trying to track down why. A quick benchmark suggests a similarly sized object full of strings takes about 2-3micros to new up, obviously this sort of benchmark is fraught with difficulty but thought it might be useful as a baseline.
There are 2 Qs here
Note the hardware involved is Solaris 10 x86 on Sun X4600s with 8*dual core opterons @ 3.2GHz
Things we've looked at include
Any & all thoughts appreciated
Since your question was more about how to go about investigating the problem rather than 'what is my problem' I'll stick with some tools to try out.
A very useful tool for getting a better idea of what is going on and when is BTrace. It is similar to DTrace however a pure java tool. On that note I'm assuming you know DTrace, if not that is also useful if not obtuse. These will give you some visibility about what is happening and when in the JVM and OS.
Oh, another thing to clarify in your original posting. What collector are you running? I'm assuming with high latency problem that you are using a low pause collector like CMS. If so have you tried any tuning?
When you are repeating the same task many times, your CPU tends to run very efficiently. This is because your cache miss times and warm up of the CPU doesn't appear as a factor. Its is also possible you are not considering your JVM warm time either.
If you try the same thing when the JVM and/or the CPU is not warmed up. You will get a very different results.
Try doing the same thing say 25 times (Less than your compile threshold) and sleep(100) between tests. You should expect to see much higher times, closer to what you see in the real application.
The behaviour of your app will differ but to illustrate my point. I have found waiting for IO can be more disruptive than a plain sleep.
When you performing your benchmark, you should try to ensure you are comparing like with like.
import java.io.*;
import java.util.Date;
/**
Cold JVM with a Hot CPU took 123 us average
Cold JVM with a Cold CPU took 403 us average
Cold JVM with a Hot CPU took 314 us average
Cold JVM with a Cold CPU took 510 us average
Cold JVM with a Hot CPU took 316 us average
Cold JVM with a Cold CPU took 514 us average
Cold JVM with a Hot CPU took 315 us average
Cold JVM with a Cold CPU took 545 us average
Cold JVM with a Hot CPU took 321 us average
Cold JVM with a Cold CPU took 542 us average
Hot JVM with a Hot CPU took 44 us average
Hot JVM with a Cold CPU took 111 us average
Hot JVM with a Hot CPU took 32 us average
Hot JVM with a Cold CPU took 96 us average
Hot JVM with a Hot CPU took 26 us average
Hot JVM with a Cold CPU took 80 us average
Hot JVM with a Hot CPU took 26 us average
Hot JVM with a Cold CPU took 90 us average
Hot JVM with a Hot CPU took 25 us average
Hot JVM with a Cold CPU took 98 us average
*/
public class HotColdBenchmark {
public static void main(String... args) {
// load all the classes.
performTest(null, 25, false);
for (int i = 0; i < 5; i++) {
// still pretty cold
performTest("Cold JVM with a Hot CPU", 25, false);
// still pretty cold
performTest("Cold JVM with a Cold CPU", 25, true);
}
// warmup the JVM
performTest(null, 10000, false);
for (int i = 0; i < 5; i++) {
// warmed up.
performTest("Hot JVM with a Hot CPU", 25, false);
// bit cold
performTest("Hot JVM with a Cold CPU", 25, true);
}
}
public static long performTest(String report, int n, boolean sleep) {
long time = 0;
long ret = 0;
for (int i = 0; i < n; i++) {
long start = System.nanoTime();
try {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ObjectOutputStream oos = new ObjectOutputStream(baos);
oos.writeObject(new Date());
oos.close();
ObjectInputStream ois = new ObjectInputStream(new ByteArrayInputStream(baos.toByteArray()));
Date d = (Date) ois.readObject();
ret += d.getTime();
time += System.nanoTime() - start;
if (sleep) Thread.sleep(100);
} catch (Exception e) {
throw new AssertionError(e);
}
}
if (report != null) {
System.out.printf("%s took %,d us average%n", report, time / n / 1000);
}
return ret;
}
}
Memory allocation can cause side-effects. Is it possible that the memory allocation is causing the heap to get compacted? Have you looked to see if your memory allocation is causing the GC to run at the same time?
Have you separately timed how long it takes to create the new ArrayLists?
There is probably no hope in expecting microsecond-latency guarantees from a general-purpose VM running on a general-purpose OS, even with such great hardware. Massive throughput is the best you can hope for. How about switching to a real-time VM if you need one (I'm talking RTSJ and all that...)
...my two cents
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With