Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is XOR much faster than OR?

Here is a benchmark:

fn benchmark_or(repetitions: usize, mut increment: usize) -> usize {
    let mut batch = 0;
    for _ in 0..repetitions {
        increment |= 1;
        batch += increment | 1;
    }
    batch
}

fn benchmark_xor(repetitions: usize, mut increment: usize) -> usize {
    let mut batch = 0;
    for _ in 0..repetitions {
        increment ^= 1;
        batch += increment | 1;
    }
    batch
}

fn benchmark(c: &mut Criterion) {
    let increment = 1;
    let repetitions = 1000;

    c.bench_function("Increment Or", |b| {
        b.iter(|| black_box(benchmark_or(repetitions, increment)))
    });
    c.bench_function("Increment Xor", |b| {
        b.iter(|| black_box(benchmark_xor(repetitions, increment)))
    });
}

The results are:

Increment Or            time:   [271.02 ns 271.14 ns 271.28 ns]
Increment Xor           time:   [79.656 ns 79.761 ns 79.885 ns] 

I get the same result if I replace or with and.

It's quite confusing as the or bench compiles to

.LBB0_5:
        or      edi, 1
        add     eax, edi
        add     rcx, -1
        jne     .LBB0_5

And the xor bench compiles to basically the same instructions plus two additional ones:

.LBB1_6:
        xor     edx, 1
        or      edi, 1
        add     eax, edi
        mov     edi, edx
        add     rcx, -1
        jne     .LBB1_6

Full Assembly code

Why is the difference so large?

like image 437
Eugene Retunsky Avatar asked Nov 25 '20 18:11

Eugene Retunsky


1 Answers

This part of the function that uses XOR which you quoted:

.LBB1_6:
        xor     rdx, 1
        or      rsi, 1
        add     rax, rsi
        mov     rsi, rdx
        add     rcx, -1
        jne     .LBB1_6

Is only the "tail end" of an unrolled loop. The "meat" (the part that actually runs a lot) is this:

.LBB1_9:
        add     rax, rdx
        add     rdi, 4
        jne     .LBB1_9

rdx is set up to be 4 times increment - in a way that I would describe as "only a compiler could be this stupid", but it only happens outside the loop so it's not a complete disaster. The loop counter is advanced by 4 in every iteration (starting negative and counting up to zero, which is clever, redeeming the compiler somewhat).

This loop could be executed at 1 iteration per cycle, translating to 4 iterations of the source-loop per cycle.

The loop in the function that uses OR is also unrolled, this is the actual "meat" of that function:

.LBB0_8:
        or      rsi, 1
        lea     rax, [rax + 2*rsi]
        lea     rax, [rax + 2*rsi]
        lea     rax, [rax + 2*rsi]
        lea     rax, [rax + 2*rsi]
        add     rdi, 8
        jne     .LBB0_8

It's unrolled by 8, which might have been nice, but chaining lea 4 times like that really takes "only a compiler could be this stupid" to the next level. The serial dependency through the leas costs at least 4 cycles per iteration, translating to 2 iterations of the source-loop per cycle.

That explains a 2x difference in performance (in favour of the XOR version), not quite your measured 3.4x difference, so further analysis could be done.

like image 158
harold Avatar answered Nov 14 '22 22:11

harold