Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Allocation latency seems high, why?

I have a (java) application that runs in a low latency environment, it typically processes instructions in ~600micros (+/- 100). Naturally as we've moved further into the microsecond space the things you see that cost latency change and right now we've noticed that 2/3 of that time is spent in the allocation of the 2 core domain objects.

Benchmarking has isolated the offending sections of code to literally the construction of the objects from existing references, i.e. basically a load of references (~15 in each class) and a couple of lists new'ed, though see note below on exactly what is measured here.

Each one consistently takes ~100micros which is inexplicable to me and I'm trying to track down why. A quick benchmark suggests a similarly sized object full of strings takes about 2-3micros to new up, obviously this sort of benchmark is fraught with difficulty but thought it might be useful as a baseline.

There are 2 Qs here

  • how does one investigate this sort of behaviour?
  • what explanations are there for slow allocation?

Note the hardware involved is Solaris 10 x86 on Sun X4600s with 8*dual core opterons @ 3.2GHz

Things we've looked at include

  • checking the PrintTLAB stats, shows v few slow allocs so there should be no contention there.
  • PrintCompilation suggests one of these bits of code is not JIT friendly though Solaris seems to have some unusual behaviour here (viz a vis a modern linux, don't have a linux of similar vintage to solaris10 to bench on right now)
  • LogCompilation... bit harder to parse to say the least so that is an ongoing job, nothing obvious so far
  • JVM versions... consistent across 6u6 and 6u14, not tried 6u18 or latest 7 yet

Any & all thoughts appreciated

A summary of the comments on assorted posts to try and make things clearer

  • the cost I'm measuring is the total cost of creating the object which is built via a Builder (like one of these) and whose private constructor invokes new ArrayList a few times as well as setting references to existing objects. The cost measured covers the cost of setting up the builder and the conversion of the builder to a domain object
  • compilation (by hotspot) has a marked impact but it's still relatively slow (compilation in this case takes it from 100micros down to ~60)
  • compilation (by hotspot) on my naive benchmark takes allocation time down from ~2micros to ~300ns
  • latency does not vary with young gen collection algo (ParNew or Parallel scavenge)
like image 960
Matt Avatar asked Nov 17 '09 20:11

Matt


4 Answers

Since your question was more about how to go about investigating the problem rather than 'what is my problem' I'll stick with some tools to try out.

A very useful tool for getting a better idea of what is going on and when is BTrace. It is similar to DTrace however a pure java tool. On that note I'm assuming you know DTrace, if not that is also useful if not obtuse. These will give you some visibility about what is happening and when in the JVM and OS.

Oh, another thing to clarify in your original posting. What collector are you running? I'm assuming with high latency problem that you are using a low pause collector like CMS. If so have you tried any tuning?

like image 146
reccles Avatar answered Nov 20 '22 02:11

reccles


When you are repeating the same task many times, your CPU tends to run very efficiently. This is because your cache miss times and warm up of the CPU doesn't appear as a factor. Its is also possible you are not considering your JVM warm time either.

If you try the same thing when the JVM and/or the CPU is not warmed up. You will get a very different results.

Try doing the same thing say 25 times (Less than your compile threshold) and sleep(100) between tests. You should expect to see much higher times, closer to what you see in the real application.

The behaviour of your app will differ but to illustrate my point. I have found waiting for IO can be more disruptive than a plain sleep.

When you performing your benchmark, you should try to ensure you are comparing like with like.

import java.io.*;
import java.util.Date;

/**
Cold JVM with a Hot CPU took 123 us average
Cold JVM with a Cold CPU took 403 us average
Cold JVM with a Hot CPU took 314 us average
Cold JVM with a Cold CPU took 510 us average
Cold JVM with a Hot CPU took 316 us average
Cold JVM with a Cold CPU took 514 us average
Cold JVM with a Hot CPU took 315 us average
Cold JVM with a Cold CPU took 545 us average
Cold JVM with a Hot CPU took 321 us average
Cold JVM with a Cold CPU took 542 us average
Hot JVM with a Hot CPU took 44 us average
Hot JVM with a Cold CPU took 111 us average
Hot JVM with a Hot CPU took 32 us average
Hot JVM with a Cold CPU took 96 us average
Hot JVM with a Hot CPU took 26 us average
Hot JVM with a Cold CPU took 80 us average
Hot JVM with a Hot CPU took 26 us average
Hot JVM with a Cold CPU took 90 us average
Hot JVM with a Hot CPU took 25 us average
Hot JVM with a Cold CPU took 98 us average
 */
public class HotColdBenchmark {
    public static void main(String... args) {
        // load all the classes.
        performTest(null, 25, false);
        for (int i = 0; i < 5; i++) {
            // still pretty cold
            performTest("Cold JVM with a Hot CPU", 25, false);
            // still pretty cold
            performTest("Cold JVM with a Cold CPU", 25, true);
        }

        // warmup the JVM
        performTest(null, 10000, false);
        for (int i = 0; i < 5; i++) {
            // warmed up.
            performTest("Hot JVM with a Hot CPU", 25, false);
            // bit cold
            performTest("Hot JVM with a Cold CPU", 25, true);
        }
    }

    public static long performTest(String report, int n, boolean sleep) {
        long time = 0;
        long ret = 0;
        for (int i = 0; i < n; i++) {
            long start = System.nanoTime();
            try {
                ByteArrayOutputStream baos = new ByteArrayOutputStream();
                ObjectOutputStream oos = new ObjectOutputStream(baos);
                oos.writeObject(new Date());
                oos.close();
                ObjectInputStream ois = new ObjectInputStream(new ByteArrayInputStream(baos.toByteArray()));
                Date d = (Date) ois.readObject();
                ret += d.getTime();
                time += System.nanoTime() - start;
                if (sleep) Thread.sleep(100);
            } catch (Exception e) {
                throw new AssertionError(e);
            }
        }
        if (report != null) {
            System.out.printf("%s took %,d us average%n", report, time / n / 1000);
        }
        return ret;
    }
}
like image 3
Peter Lawrey Avatar answered Nov 20 '22 03:11

Peter Lawrey


Memory allocation can cause side-effects. Is it possible that the memory allocation is causing the heap to get compacted? Have you looked to see if your memory allocation is causing the GC to run at the same time?

Have you separately timed how long it takes to create the new ArrayLists?

like image 2
RickNZ Avatar answered Nov 20 '22 03:11

RickNZ


There is probably no hope in expecting microsecond-latency guarantees from a general-purpose VM running on a general-purpose OS, even with such great hardware. Massive throughput is the best you can hope for. How about switching to a real-time VM if you need one (I'm talking RTSJ and all that...)

...my two cents

like image 2
Gyom Avatar answered Nov 20 '22 03:11

Gyom