I wrote a simple benchmark in order to find out if bounds check can be eliminated when the array gets computed via bitwise and. This is basically what nearly all hash tables do: They compute
h & (table.length - 1)
as an index into the table
, where h
is the hashCode
or a derived value. The results shows that the bounds check don't get eliminated.
The idea of my benchmark is pretty simple: Compute two values i
and j
, where both are guaranteed to be valid array indexes.
i
is the loop counter. When it gets used as array index, the bounds check gets eliminated.j
gets computed as x & (table.length - 1)
, where x
is some value changing on each iteration. When it gets used as array index, the bounds check does not get eliminated.The relevant part is as follows:
for (int i=0; i<=table.length-1; ++i) {
x += result;
final int j = x & (table.length-1);
result ^= i + table[j];
}
The other experiment uses
result ^= table[i] + j;
instead. The difference in timing is maybe 15% (pretty consistently across different variants I've tried). My questions:
j
?MarkoTopolnik's answer shows that it's all more complicated and the elimination of the bounds checks is not guaranteed to be a win, especially on his computer the "normal" code is slower than "masked". I guess this is because of it allowing some additional optimization which shows to be actually detrimental in this case (given the complexity of the current CPUs, the compiler hardly even knows for sure).
leventov's answer shows clearly that the array bounds check gets done in "masked" and that it's elimination makes the code as fast as "normal".
Donal Fellows points to the fact, that the masking doesn't work for a zero-length table, as x & (0-1)
equals to x
. So the best thing the compiler can do is to replace the bound check by a zero-length check. But this is IMHO still worth it, as the zero-length check can be moved out of the loop easily.
Because of the the equivalence a[x & (a.length - 1)]
throws if and only if a.length == 0
, the compiler can do the following:
Such an optimization should be pretty simple and cheap as it only looks at the parent nodes in the SSA graph. Unlike many complex optimizations, it can never be detrimental, as it only replaces one check by a slightly simpler one; so there's no problem, not even if it can't be moved out of the loop.
I'll post this to the hotspot-dev mailing lists.
John Rose filed an RFE and there's already a "quick-and-dirty" patch.
Bounds checking is any method that is used to check whether a given variable is within acceptable bounds before it is used. This includes ensuring that a number is within a valid range (range checking), as well as checking whether an array index is within bounds for a given array (index checking).
In computer programming, bounds checking is any method of detecting whether a variable is within some bounds before it is used. It is usually used to ensure that a number fits into a given type (range checking), or that a variable being used as an array index is within the bounds of the array (index checking).
Array bound checking refers to determining whether all array references in a program are within their declared ranges. This checking is critical for software verification and validation because subscripting arrays beyond their declared sizes may produce unexpected results, security holes, or failures.
Since the array is constructed as the program is running, the compiler does not know its length and can't detect some errors. As a Java program is running, each time an array index is used it is checked to be sure that it is OK. This is called bounds checking, and is extremely important for catching errors.
To start off, the main difference between your two tests is definitely in bounds check elimination; however, the way this influences the machine code is far from what the naïve expectation would suggest.
The bounds check figures more strongly as a loop exit point than as additional code which introduces overhead.
The loop exit point prevents the following optimization which I have culled from the emitted machine code:
If the loop can break out at any step, this staging would result in work performed for loop steps which were never actually taken.
Consider this slight modification of your code:
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@OperationsPerInvocation(Measure.N)
@Warmup(iterations = 3, time = 1)
@Measurement(iterations = 5, time = 1)
@State(Scope.Thread)
@Threads(1)
@Fork(1)
public class Measure {
public static final int N = 1024;
private final int[] table = new int[N];
@Setup public void setUp() {
final Random random = new Random();
for (int i = 0; i < table.length; ++i) {
final int x = random.nextInt();
table[i] = x == 0? 1 : x;
}
}
@GenerateMicroBenchmark public int normalIndex() {
int result = 0;
final int[] table = this.table;
int x = 0;
for (int i = 0; i <= table.length - 1; ++i) {
x += i;
final int j = x & (table.length - 1);
final int entry = table[i];
result ^= entry + j;
if (entry == 0) break;
}
return result;
}
@GenerateMicroBenchmark public int maskedIndex() {
int result = 0;
final int[] table = this.table;
int x = 0;
for (int i = 0; i <= table.length - 1; ++i) {
x += i;
final int j = x & (table.length - 1);
final int entry = table[j];
result ^= i + entry;
if (entry == 0) break;
}
return result;
}
}
There is just one difference: I have added the check
if (entry == 0) break;
to give the loop a way to exit prematurely on any step. (I also introduced a guard to ensure no array entries are actually 0.)
On my machine, this is the result:
Benchmark Mode Samples Mean Mean error Units
o.s.Measure.maskedIndex avgt 5 1.378 0.229 ns/op
o.s.Measure.normalIndex avgt 5 0.924 0.092 ns/op
the "normal index" variant is substantially faster, as generally expected.
However, let us remove the additional check:
// if (entry == 0) break;
Now my results are these:
Benchmark Mode Samples Mean Mean error Units
o.s.Measure.maskedIndex avgt 5 1.130 0.065 ns/op
o.s.Measure.normalIndex avgt 5 1.229 0.053 ns/op
"Masked index" responded predictably (reduced overhead), but "normal index" is suddenly much worse. This is apparently due to a bad fit between the additional optimization step and my specific CPU model.
The performance model at such a detailed level is very unstable and, as witnessed on my CPU, even erratic.
I've extended a benchmark by Marko Topolnik:
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@OperationsPerInvocation(BCElimination.N)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 10, time = 1)
@State(Scope.Thread)
@Threads(1)
@Fork(2)
public class BCElimination {
public static final int N = 1024;
private static final Unsafe U;
private static final long INT_BASE;
private static final long INT_SCALE;
static {
try {
Field f = Unsafe.class.getDeclaredField("theUnsafe");
f.setAccessible(true);
U = (Unsafe) f.get(null);
} catch (Exception e) {
throw new IllegalStateException(e);
}
INT_BASE = U.arrayBaseOffset(int[].class);
INT_SCALE = U.arrayIndexScale(int[].class);
}
private final int[] table = new int[BCElimination.N];
@Setup public void setUp() {
final Random random = new Random();
for (int i=0; i<table.length; ++i) table[i] = random.nextInt();
}
@GenerateMicroBenchmark public int normalIndex() {
int result = 0;
final int[] table = this.table;
int x = 0;
for (int i=0; i<=table.length-1; ++i) {
x += i;
final int j = x & (table.length-1);
result ^= table[i] + j;
}
return result;
}
@GenerateMicroBenchmark public int maskedIndex() {
int result = 0;
final int[] table = this.table;
int x = 0;
for (int i=0; i<=table.length-1; ++i) {
x += i;
final int j = x & (table.length-1);
result ^= i + table[j];
}
return result;
}
@GenerateMicroBenchmark public int maskedIndexUnsafe() {
int result = 0;
final int[] table = this.table;
long x = 0;
for (int i=0; i<=table.length-1; ++i) {
x += i * INT_SCALE;
final long j = x & ((table.length-1) * INT_SCALE);
result ^= i + U.getInt(table, INT_BASE + j);
}
return result;
}
}
Results:
Benchmark Mean Mean error Units
BCElimination.maskedIndex 1,235 0,004 ns/op
BCElimination.maskedIndexUnsafe 1,092 0,007 ns/op
BCElimination.normalIndex 1,071 0,008 ns/op
2. The second question is for hotspot-dev mailing lists rather than StackOverflow, IMHO.
In order to safely eliminate that bounds check, it is necessary to prove that
h & (table.length - 1)
is guaranteed to produce a valid index into table
. It won't if table.length
is zero (as you'll end up with & -1
, an effective-noop). It also won't usefully do it if table.length
is not a power of 2 (you'll lose information; consider the case where table.length
is 17).
How can the HotSpot compiler know that these bad conditions are not true? It has to be more conservative than a programmer can be, as the programmer can know more about the high-level constraints on the system (e.g., that the array is never empty and always as a number of elements that is a power-of-two).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With