Here is a benchmark:
fn benchmark_or(repetitions: usize, mut increment: usize) -> usize {
let mut batch = 0;
for _ in 0..repetitions {
increment |= 1;
batch += increment | 1;
}
batch
}
fn benchmark_xor(repetitions: usize, mut increment: usize) -> usize {
let mut batch = 0;
for _ in 0..repetitions {
increment ^= 1;
batch += increment | 1;
}
batch
}
fn benchmark(c: &mut Criterion) {
let increment = 1;
let repetitions = 1000;
c.bench_function("Increment Or", |b| {
b.iter(|| black_box(benchmark_or(repetitions, increment)))
});
c.bench_function("Increment Xor", |b| {
b.iter(|| black_box(benchmark_xor(repetitions, increment)))
});
}
The results are:
Increment Or time: [271.02 ns 271.14 ns 271.28 ns]
Increment Xor time: [79.656 ns 79.761 ns 79.885 ns]
I get the same result if I replace or
with and
.
It's quite confusing as the or
bench compiles to
.LBB0_5:
or edi, 1
add eax, edi
add rcx, -1
jne .LBB0_5
And the xor
bench compiles to basically the same instructions plus two additional ones:
.LBB1_6:
xor edx, 1
or edi, 1
add eax, edi
mov edi, edx
add rcx, -1
jne .LBB1_6
Full Assembly code
Why is the difference so large?
This part of the function that uses XOR which you quoted:
.LBB1_6:
xor rdx, 1
or rsi, 1
add rax, rsi
mov rsi, rdx
add rcx, -1
jne .LBB1_6
Is only the "tail end" of an unrolled loop. The "meat" (the part that actually runs a lot) is this:
.LBB1_9:
add rax, rdx
add rdi, 4
jne .LBB1_9
rdx
is set up to be 4 times increment
- in a way that I would describe as "only a compiler could be this stupid", but it only happens outside the loop so it's not a complete disaster. The loop counter is advanced by 4 in every iteration (starting negative and counting up to zero, which is clever, redeeming the compiler somewhat).
This loop could be executed at 1 iteration per cycle, translating to 4 iterations of the source-loop per cycle.
The loop in the function that uses OR is also unrolled, this is the actual "meat" of that function:
.LBB0_8:
or rsi, 1
lea rax, [rax + 2*rsi]
lea rax, [rax + 2*rsi]
lea rax, [rax + 2*rsi]
lea rax, [rax + 2*rsi]
add rdi, 8
jne .LBB0_8
It's unrolled by 8, which might have been nice, but chaining lea
4 times like that really takes "only a compiler could be this stupid" to the next level. The serial dependency through the lea
s costs at least 4 cycles per iteration, translating to 2 iterations of the source-loop per cycle.
That explains a 2x difference in performance (in favour of the XOR version), not quite your measured 3.4x difference, so further analysis could be done.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With