While I was testing the read performance of a direct java.nio.ByteBuffer I noticed that the absolute read is on average 2x times faster than the relative read. Also if I compare the source code of the relative vs absolute read, the code is pretty much the same except that the relative read maintains and internal counter. I wonder why do I see such a considerable difference in speed?
Below is the source code of my JMH benchmark:
public class DirectByteBufferReadBenchmark {
private static final int OBJ_SIZE = 8 + 4 + 1;
private static final int NUM_ELEM = 10_000_000;
@State(Scope.Benchmark)
public static class Data {
private ByteBuffer directByteBuffer;
@Setup
public void setup() {
directByteBuffer = ByteBuffer.allocateDirect(OBJ_SIZE * NUM_ELEM);
for (int i = 0; i < NUM_ELEM; i++) {
directByteBuffer.putLong(i);
directByteBuffer.putInt(i);
directByteBuffer.put((byte) (i & 1));
}
}
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
public long testReadAbsolute(Data d) throws InterruptedException {
long val = 0l;
for (int i = 0; i < NUM_ELEM; i++) {
int index = OBJ_SIZE * i;
val += d.directByteBuffer.getLong(index);
d.directByteBuffer.getInt(index + 8);
d.directByteBuffer.get(index + 12);
}
return val;
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
public long testReadRelative(Data d) throws InterruptedException {
d.directByteBuffer.rewind();
long val = 0l;
for (int i = 0; i < NUM_ELEM; i++) {
val += d.directByteBuffer.getLong();
d.directByteBuffer.getInt();
d.directByteBuffer.get();
}
return val;
}
public static void main(String[] args) throws Exception {
Options opt = new OptionsBuilder()
.include(DirectByteBufferReadBenchmark.class.getSimpleName())
.warmupIterations(5)
.measurementIterations(5)
.forks(3)
.threads(1)
.build();
new Runner(opt).run();
}
}
And these are the results of my benchmark run:
Benchmark Mode Cnt Score Error Units
DirectByteBufferReadBenchmark.testReadAbsolute thrpt 15 88.605 ± 9.276 ops/s
DirectByteBufferReadBenchmark.testReadRelative thrpt 15 42.904 ± 3.018 ops/s
The test was run on a MacbookPro (2.2GHz Intel Core i7, 16Gb DDR3) and JDK 1.8.0_73.
UPDATE
I run the same test with JDK 9-ea b134. Both test show a ~10% speed increase but the speed difference between the two remains similar.
# JMH 1.13 (released 45 days ago)
# VM version: JDK 9-ea, VM 9-ea+134
# VM invoker: /Library/Java/JavaVirtualMachines/jdk-9.jdk/Contents/Home/bin/java
# VM options: <none>
Benchmark Mode Cnt Score Error Units
DirectByteBufferReadBenchmark.testReadAbsolute thrpt 15 102.170 ± 10.199 ops/s
DirectByteBufferReadBenchmark.testReadRelative thrpt 15 45.988 ± 3.896 ops/s
A direct buffer is a chunk of native memory shared with Java from which you can perform a direct read. An instance of DirectByteBuffer can be created using the ByteBuffer.
ByteBuffer limit() methods in Java with ExamplesThe limit() method of java. nio. ByteBuffer Class is used to set this buffer's limit. If the position is larger than the new limit then it is set to the new limit. If the mark is defined and larger than the new limit then it is discarded.
ByteBuffer flip() methods in Java with Examples After a sequence of channel-read or put operations, invoke this method to prepare for a sequence of channel-write or relative get operations. This method is often used in conjunction with the compact method when transferring data from one place to another.
ByteBuffer holds a sequence of integer values to be used in an I/O operation. The ByteBuffer class provides the following four categories of operations upon long buffers: Absolute and relative get method that read single bytes. Absolute and relative put methods that write single bytes.
JDK 8 indeed generates worse code for the loop with relative ByteBuffer access.
JMH has built-in perfasm
profiler that prints generated assembly code for the hottest regions. I've used it to compare the compiled testReadAbsolute
vs. testReadRelative
, and here are the main differences:
Relative getLong / getInt/ get
update position field of the ByteBuffer
. VM does not optimize these updates: there are 3 memory writes on each loop iteration.
position
range check is not eliminated: conditional branches on each loop iteration remained in compiled code.
Since redundant field updates and range checks make the loop body longer, VM unrolls only 2 iterations of the loop. The compiled version for the loop with absolute access has 16 iterations unrolled.
testReadAbsolute
is compiled very well: the main loop just reads 16 longs, sums them up and jumps to the next iteration if index < 10_000_000 - 16
. The state of directByteBuffer
is not updated. However, JVM is not that smart for testReadRelative
: seems like it cannot optimize field access of an object from outside.
There was much work in JDK 9 to optimize ByteBuffer. I've run the same test on JDK 9-ea b134, and verified that testReadRelative
does not have redundant memory writes and range checks. Now it runs almost as fast as testReadAbsolute
.
// JDK 1.8.0_92, VM 25.92-b14
Benchmark Mode Cnt Score Error Units
DirectByteBufferReadBenchmark.testReadAbsolute thrpt 10 99,727 ± 0,542 ops/s
DirectByteBufferReadBenchmark.testReadRelative thrpt 10 47,126 ± 0,289 ops/s
// JDK 9-ea, VM 9-ea+134
Benchmark Mode Cnt Score Error Units
DirectByteBufferReadBenchmark.testReadAbsolute thrpt 10 109,369 ± 0,403 ops/s
DirectByteBufferReadBenchmark.testReadRelative thrpt 10 97,140 ± 0,572 ops/s
UPDATE
In order to help JIT compiler with optimization I've introduced local variable
ByteBuffer directByteBuffer = d.directByteBuffer
in both benchmarks. Otherwise level of indirection does not allow compiler to eliminate ByteBuffer.position
field updates.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With