Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ways to improve performance consistency

In the following example, one thread is sending "messages" via a ByteBuffer which is the consumer is taking. The best performance is very good but its not consistent.

public class Main {     public static void main(String... args) throws IOException {         for (int i = 0; i < 10; i++)             doTest();     }      public static void doTest() {         final ByteBuffer writeBuffer = ByteBuffer.allocateDirect(64 * 1024);         final ByteBuffer readBuffer = writeBuffer.slice();         final AtomicInteger readCount = new PaddedAtomicInteger();         final AtomicInteger writeCount = new PaddedAtomicInteger();          for(int i=0;i<3;i++)             performTiming(writeBuffer, readBuffer, readCount, writeCount);         System.out.println();     }      private static void performTiming(ByteBuffer writeBuffer, final ByteBuffer readBuffer, final AtomicInteger readCount, final AtomicInteger writeCount) {         writeBuffer.clear();         readBuffer.clear();         readCount.set(0);         writeCount.set(0);          Thread t = new Thread(new Runnable() {             @Override             public void run() {                 byte[] bytes = new byte[128];                 while (!Thread.interrupted()) {                     int rc = readCount.get(), toRead;                     while ((toRead = writeCount.get() - rc) <= 0) ;                     for (int i = 0; i < toRead; i++) {                         byte len = readBuffer.get();                         if (len == -1) {                             // rewind.                             readBuffer.clear(); //                            rc++;                         } else {                             int num = readBuffer.getInt();                             if (num != rc)                                 throw new AssertionError("Expected " + rc + " but got " + num) ;                             rc++;                             readBuffer.get(bytes, 0, len - 4);                         }                     }                     readCount.lazySet(rc);                 }             }         });         t.setDaemon(true);         t.start();         Thread.yield();         long start = System.nanoTime();         int runs = 30 * 1000 * 1000;         int len = 32;         byte[] bytes = new byte[len - 4];         int wc = writeCount.get();         for (int i = 0; i < runs; i++) {             if (writeBuffer.remaining() < len + 1) {                 // reader has to catch up.                 while (wc - readCount.get() > 0) ;                 // rewind.                 writeBuffer.put((byte) -1);                 writeBuffer.clear();             }             writeBuffer.put((byte) len);             writeBuffer.putInt(i);             writeBuffer.put(bytes);             writeCount.lazySet(++wc);         }         // reader has to catch up.         while (wc - readCount.get() > 0) ;         t.interrupt();         t.stop();         long time = System.nanoTime() - start;         System.out.printf("Message rate was %.1f M/s offsets %d %d %d%n", runs * 1e3 / time                 , addressOf(readBuffer) - addressOf(writeBuffer)                 , addressOf(readCount) - addressOf(writeBuffer)                 , addressOf(writeCount) - addressOf(writeBuffer)         );     }      // assumes -XX:+UseCompressedOops.     public static long addressOf(Object... o) {         long offset = UNSAFE.arrayBaseOffset(o.getClass());         return UNSAFE.getInt(o, offset) * 8L;     }      public static final Unsafe UNSAFE = getUnsafe();     public static Unsafe getUnsafe() {         try {             Field field = Unsafe.class.getDeclaredField("theUnsafe");             field.setAccessible(true);             return (Unsafe) field.get(null);         } catch (Exception e) {             throw new AssertionError(e);         }     }      private static class PaddedAtomicInteger extends AtomicInteger {         public long p2, p3, p4, p5, p6, p7;          public long sum() { //            return 0;             return p2 + p3 + p4 + p5 + p6 + p7;         }     } } 

prints the timings for the same block of data. The numbers at the end are the relative addresses of the objects which show they are layed out in cache the same each time. Running longer tests of 10 shows that a given combination produces the same performance repeatedly.

Message rate was 63.2 M/s offsets 136 200 264 Message rate was 80.4 M/s offsets 136 200 264 Message rate was 80.0 M/s offsets 136 200 264  Message rate was 81.9 M/s offsets 136 200 264 Message rate was 82.2 M/s offsets 136 200 264 Message rate was 82.5 M/s offsets 136 200 264  Message rate was 79.1 M/s offsets 136 200 264 Message rate was 82.4 M/s offsets 136 200 264 Message rate was 82.4 M/s offsets 136 200 264  Message rate was 34.7 M/s offsets 136 200 264 Message rate was 39.1 M/s offsets 136 200 264 Message rate was 39.0 M/s offsets 136 200 264 

Each set of buffers and counter are tested three times and those buffers appear to give similar results. SO I believe there is something about the way these buffers are layed out in memory I am not seeing.

Is there anything which might give the higher performance more often? It looks like a cache collision but I can't see where this could be happening.

BTW: M/s is millions of messages per second and is more than anyone is likely to need, but it would be good to understand how to make it consistently fast.


EDIT: Using synchronized with wait and notify makes the result much more consistent. But not faster.

Message rate was 6.9 M/s Message rate was 7.8 M/s Message rate was 7.9 M/s Message rate was 6.7 M/s Message rate was 7.5 M/s Message rate was 7.7 M/s Message rate was 7.3 M/s Message rate was 7.9 M/s Message rate was 6.4 M/s Message rate was 7.8 M/s 

EDIT: By using task set, I can make the performance consistent if I lock the two threads to change the same core.

Message rate was 35.1 M/s offsets 136 200 216 Message rate was 34.0 M/s offsets 136 200 216 Message rate was 35.4 M/s offsets 136 200 216  Message rate was 35.6 M/s offsets 136 200 216 Message rate was 37.0 M/s offsets 136 200 216 Message rate was 37.2 M/s offsets 136 200 216  Message rate was 37.1 M/s offsets 136 200 216 Message rate was 35.0 M/s offsets 136 200 216 Message rate was 37.1 M/s offsets 136 200 216  If I use any two logical threads on different cores, I get the inconsistent behaviour  Message rate was 60.2 M/s offsets 136 200 216 Message rate was 68.7 M/s offsets 136 200 216 Message rate was 55.3 M/s offsets 136 200 216  Message rate was 39.2 M/s offsets 136 200 216 Message rate was 39.1 M/s offsets 136 200 216 Message rate was 37.5 M/s offsets 136 200 216  Message rate was 75.3 M/s offsets 136 200 216 Message rate was 73.8 M/s offsets 136 200 216 Message rate was 66.8 M/s offsets 136 200 216 

EDIT: It appears that triggering a GC will shift the behaviour. These show repeated test on the same buffer+counters with a manually trigger GC halfway.

faster after GC  Message rate was 27.4 M/s offsets 136 200 216 Message rate was 27.8 M/s offsets 136 200 216 Message rate was 29.6 M/s offsets 136 200 216 Message rate was 27.7 M/s offsets 136 200 216 Message rate was 29.6 M/s offsets 136 200 216 [GC 14312K->1518K(244544K), 0.0003050 secs] [Full GC 1518K->1328K(244544K), 0.0068270 secs] Message rate was 34.7 M/s offsets 64 128 144 Message rate was 54.5 M/s offsets 64 128 144 Message rate was 54.1 M/s offsets 64 128 144 Message rate was 51.9 M/s offsets 64 128 144 Message rate was 57.2 M/s offsets 64 128 144  and slower  Message rate was 61.1 M/s offsets 136 200 216 Message rate was 61.8 M/s offsets 136 200 216 Message rate was 60.5 M/s offsets 136 200 216 Message rate was 61.1 M/s offsets 136 200 216 [GC 35740K->1440K(244544K), 0.0018170 secs] [Full GC 1440K->1302K(244544K), 0.0071290 secs] Message rate was 53.9 M/s offsets 64 128 144 Message rate was 54.3 M/s offsets 64 128 144 Message rate was 50.8 M/s offsets 64 128 144 Message rate was 56.6 M/s offsets 64 128 144 Message rate was 56.0 M/s offsets 64 128 144 Message rate was 53.6 M/s offsets 64 128 144 

EDIT: Using @BegemoT's library to print the core id used I get the following on a 3.8 GHz i7 (home PC)

Note: the offsets are incorrect by a factor of 8. As the heap size was small, the JVM doesn't multiply the reference by 8 like it does with a heap which is larger (but less than 32 GB).

writer.currentCore() -> Core[#0] reader.currentCore() -> Core[#5] Message rate was 54.4 M/s offsets 3392 3904 4416 writer.currentCore() -> Core[#0] reader.currentCore() -> Core[#6] Message rate was 54.2 M/s offsets 3392 3904 4416 writer.currentCore() -> Core[#0] reader.currentCore() -> Core[#5] Message rate was 60.7 M/s offsets 3392 3904 4416  writer.currentCore() -> Core[#0] reader.currentCore() -> Core[#5] Message rate was 25.5 M/s offsets 1088 1600 2112 writer.currentCore() -> Core[#0] reader.currentCore() -> Core[#5] Message rate was 25.9 M/s offsets 1088 1600 2112 writer.currentCore() -> Core[#0] reader.currentCore() -> Core[#5] Message rate was 26.0 M/s offsets 1088 1600 2112  writer.currentCore() -> Core[#0] reader.currentCore() -> Core[#5] Message rate was 61.0 M/s offsets 1088 1600 2112 writer.currentCore() -> Core[#0] reader.currentCore() -> Core[#5] Message rate was 61.8 M/s offsets 1088 1600 2112 writer.currentCore() -> Core[#0] reader.currentCore() -> Core[#5] Message rate was 60.7 M/s offsets 1088 1600 2112 

You can see that the same logical threads are being used, but the performance varies, between runs, but not within a run (within a run the same objects are used)


I have found the problem. It was a memory layout issue but I could see a simple way to resolve it. ByteBuffer cannot be extended so you can't add padding so I create an object I discard.

    final ByteBuffer writeBuffer = ByteBuffer.allocateDirect(64 * 1024);     final ByteBuffer readBuffer = writeBuffer.slice();     new PaddedAtomicInteger();     final AtomicInteger readCount = new PaddedAtomicInteger();     final AtomicInteger writeCount = new PaddedAtomicInteger(); 

Without this extra padding (of the object which is not used), the results look like this on a 3.8 GHz i7.

Message rate was 38.5 M/s offsets 3392 3904 4416 Message rate was 54.7 M/s offsets 3392 3904 4416 Message rate was 59.4 M/s offsets 3392 3904 4416  Message rate was 54.3 M/s offsets 1088 1600 2112 Message rate was 56.3 M/s offsets 1088 1600 2112 Message rate was 56.6 M/s offsets 1088 1600 2112  Message rate was 28.0 M/s offsets 1088 1600 2112 Message rate was 28.1 M/s offsets 1088 1600 2112 Message rate was 28.0 M/s offsets 1088 1600 2112  Message rate was 17.4 M/s offsets 1088 1600 2112 Message rate was 17.4 M/s offsets 1088 1600 2112 Message rate was 17.4 M/s offsets 1088 1600 2112  Message rate was 54.5 M/s offsets 1088 1600 2112 Message rate was 54.2 M/s offsets 1088 1600 2112 Message rate was 55.1 M/s offsets 1088 1600 2112  Message rate was 25.5 M/s offsets 1088 1600 2112 Message rate was 25.6 M/s offsets 1088 1600 2112 Message rate was 25.6 M/s offsets 1088 1600 2112  Message rate was 56.6 M/s offsets 1088 1600 2112 Message rate was 54.7 M/s offsets 1088 1600 2112 Message rate was 54.4 M/s offsets 1088 1600 2112  Message rate was 57.0 M/s offsets 1088 1600 2112 Message rate was 55.9 M/s offsets 1088 1600 2112 Message rate was 56.3 M/s offsets 1088 1600 2112  Message rate was 51.4 M/s offsets 1088 1600 2112 Message rate was 56.6 M/s offsets 1088 1600 2112 Message rate was 56.1 M/s offsets 1088 1600 2112  Message rate was 46.4 M/s offsets 1088 1600 2112 Message rate was 46.4 M/s offsets 1088 1600 2112 Message rate was 47.4 M/s offsets 1088 1600 2112 

with the discarded padded object.

Message rate was 54.3 M/s offsets 3392 4416 4928 Message rate was 53.1 M/s offsets 3392 4416 4928 Message rate was 59.2 M/s offsets 3392 4416 4928  Message rate was 58.8 M/s offsets 1088 2112 2624 Message rate was 58.9 M/s offsets 1088 2112 2624 Message rate was 59.3 M/s offsets 1088 2112 2624  Message rate was 59.4 M/s offsets 1088 2112 2624 Message rate was 59.0 M/s offsets 1088 2112 2624 Message rate was 59.8 M/s offsets 1088 2112 2624  Message rate was 59.8 M/s offsets 1088 2112 2624 Message rate was 59.8 M/s offsets 1088 2112 2624 Message rate was 59.2 M/s offsets 1088 2112 2624  Message rate was 60.5 M/s offsets 1088 2112 2624 Message rate was 60.5 M/s offsets 1088 2112 2624 Message rate was 60.5 M/s offsets 1088 2112 2624  Message rate was 60.5 M/s offsets 1088 2112 2624 Message rate was 60.9 M/s offsets 1088 2112 2624 Message rate was 60.6 M/s offsets 1088 2112 2624  Message rate was 59.6 M/s offsets 1088 2112 2624 Message rate was 60.3 M/s offsets 1088 2112 2624 Message rate was 60.5 M/s offsets 1088 2112 2624  Message rate was 60.9 M/s offsets 1088 2112 2624 Message rate was 60.5 M/s offsets 1088 2112 2624 Message rate was 60.5 M/s offsets 1088 2112 2624  Message rate was 60.7 M/s offsets 1088 2112 2624 Message rate was 61.6 M/s offsets 1088 2112 2624 Message rate was 60.8 M/s offsets 1088 2112 2624  Message rate was 60.3 M/s offsets 1088 2112 2624 Message rate was 60.7 M/s offsets 1088 2112 2624 Message rate was 58.3 M/s offsets 1088 2112 2624 

Unfortunately there is always the risk that after a GC, the objects will not be laid out optimally. The only way to resolve this may be to add padding to the original class. :(

like image 242
Peter Lawrey Avatar asked Nov 01 '11 16:11

Peter Lawrey


People also ask

What is consistency performance?

Consistency doesn't mean you must do everything identically; it just means that the performance is consistent. The client shouldn't be able to easily tell who did the work. After you've trained the new employee, follow up.


1 Answers

I'm not an expert in the area of processor caches but I suspect your issue is essentially a cache issue or some other memory layout problem. Repeated allocation of the buffers and counters without cleaning up the old objects may be causing you to periodically get a very bad cache layout, which may lead to your inconsistent performance.

Using your code and making a few mods I have been able to make the performance consistent (my test machine is Intel Core2 Quad CPU Q6600 2.4GHz w/ Win7x64 - so not quite the same but hopefully close enough to have relevant results). I've done this in two different ways both of which have roughly the same effect.

First, move creation of the buffers and counters outside of the doTest method so that they are created only once and then reused for each pass of the test. Now you get the one allocation, it sits nicely in the cache and performance is consistent.

Another way to get the same reuse but with "different" buffers/counters was to insert a gc after the performTiming loop:

for ( int i = 0; i < 3; i++ )     performTiming ( writeBuffer, readBuffer, readCount, writeCount ); System.out.println (); System.gc (); 

Here the result is more or less the same - the gc lets the buffers/counters be reclaimed and the next allocation ends up reusing the same memory (at least on my test system) and you end up in cache with consistent performance (I also added printing of the actual addresses to verify reuse of the same locations). My guess is that without the clean up leading to reuse you eventually end up with a buffer allocated that doesn't fit into the cache and your performance suffers as it is swapped in. I suspect that you could do some strange things with order of allocation (like you can make the performance worse on my machine by moving the counter allocation in front of the buffers) or creating some dead space around each run to "purge" the cache if you didn't want to eliminate the buffers from a prior loop.

Finally, as I said, processor cache and the fun of memory layouts aren't my area of expertise so if the explanations are misleading or wrong - sorry about that.

like image 193
philwb Avatar answered Oct 17 '22 22:10

philwb