Why is XOR much faster than OR?

Question

Here is a benchmark:

fn benchmark_or(repetitions: usize, mut increment: usize) -> usize {
    let mut batch = 0;
    for _ in 0..repetitions {
        increment |= 1;
        batch += increment | 1;
    }
    batch
}

fn benchmark_xor(repetitions: usize, mut increment: usize) -> usize {
    let mut batch = 0;
    for _ in 0..repetitions {
        increment ^= 1;
        batch += increment | 1;
    }
    batch
}

fn benchmark(c: &mut Criterion) {
    let increment = 1;
    let repetitions = 1000;

    c.bench_function("Increment Or", |b| {
        b.iter(|| black_box(benchmark_or(repetitions, increment)))
    });
    c.bench_function("Increment Xor", |b| {
        b.iter(|| black_box(benchmark_xor(repetitions, increment)))
    });
}

The results are:

Increment Or            time:   [271.02 ns 271.14 ns 271.28 ns]
Increment Xor           time:   [79.656 ns 79.761 ns 79.885 ns]

I get the same result if I replace or with and.

It's quite confusing as the or bench compiles to

.LBB0_5:
        or      edi, 1
        add     eax, edi
        add     rcx, -1
        jne     .LBB0_5

And the xor bench compiles to basically the same instructions plus two additional ones:

.LBB1_6:
        xor     edx, 1
        or      edi, 1
        add     eax, edi
        mov     edi, edx
        add     rcx, -1
        jne     .LBB1_6

Full Assembly code

Why is the difference so large?

harold · Accepted Answer

This part of the function that uses XOR which you quoted:

.LBB1_6:
        xor     rdx, 1
        or      rsi, 1
        add     rax, rsi
        mov     rsi, rdx
        add     rcx, -1
        jne     .LBB1_6

Is only the "tail end" of an unrolled loop. The "meat" (the part that actually runs a lot) is this:

.LBB1_9:
        add     rax, rdx
        add     rdi, 4
        jne     .LBB1_9

rdx is set up to be 4 times increment - in a way that I would describe as "only a compiler could be this stupid", but it only happens outside the loop so it's not a complete disaster. The loop counter is advanced by 4 in every iteration (starting negative and counting up to zero, which is clever, redeeming the compiler somewhat).

This loop could be executed at 1 iteration per cycle, translating to 4 iterations of the source-loop per cycle.

The loop in the function that uses OR is also unrolled, this is the actual "meat" of that function:

.LBB0_8:
        or      rsi, 1
        lea     rax, [rax + 2*rsi]
        lea     rax, [rax + 2*rsi]
        lea     rax, [rax + 2*rsi]
        lea     rax, [rax + 2*rsi]
        add     rdi, 8
        jne     .LBB0_8

It's unrolled by 8, which might have been nice, but chaining lea 4 times like that really takes "only a compiler could be this stupid" to the next level. The serial dependency through the leas costs at least 4 cycles per iteration, translating to 2 iterations of the source-loop per cycle.

That explains a 2x difference in performance (in favour of the XOR version), not quite your measured 3.4x difference, so further analysis could be done.

Why is XOR much faster than OR?

Tags:

performance

assembly

rust

Eugene Retunsky

1 Answers

harold

Recent Activity

Donate For Us

Why is XOR much faster than OR?

Tags:

performance

assembly

rust

Eugene Retunsky

1 Answers

harold

Related questions

Recent Activity

Donate For Us