Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does sorting make this branchless code faster?

Tags:

c++

-Edit2- Seems like occasionally my CPU runs unsorted as fast as sorted. On other machines they're consistently the same speed. I guess I'm not benchmarking correctly or there's subtle things going on behind the scene

-Edit- ASM code below. The generated code is the same in the main loop. Can someone confirm if it's faster or the same? I'm using clang 10 and I'm on ryzen 3600. According to compiler explorer there is no branches, the adding uses a SIMD instruction that adds either A or B based on the mask. The mask is from the >= 128 and B is a vector of 0's. So no there is no hidden branch from what I can see.

I thought this very popular question was silly and tried it in clang Why is processing a sorted array faster than processing an unsorted array?

With g++ -O2 it takes 2seconds with clang it took 0.32. I tried making it branchless like the below and found that even though it's branchless sorting still makes it faster. Whats going on?! (also it seems like O2 vectorize the code so I didn't try to do more)

#include <algorithm>
#include <ctime>
#include <stdio.h>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster.
    std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            bool v = data[c] >= 128;
            sum += data[c] * v;
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    printf("%f\nsum = %lld\n", elapsedTime, sum);
}

.LBB0_4:                                #   Parent Loop BB0_3 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
    vmovdqa (%rsp,%rcx,4), %xmm5
    vmovdqa 16(%rsp,%rcx,4), %xmm6
    vmovdqa 32(%rsp,%rcx,4), %xmm7
    vmovdqa 48(%rsp,%rcx,4), %xmm0
    addq    $16, %rcx
    vpcmpgtd    %xmm8, %xmm5, %xmm9
    vpcmpgtd    %xmm8, %xmm6, %xmm10
    vpcmpgtd    %xmm8, %xmm7, %xmm11
    vpcmpgtd    %xmm8, %xmm0, %xmm12
    vpand   %xmm5, %xmm9, %xmm5
    vpand   %xmm0, %xmm12, %xmm0
    vpand   %xmm6, %xmm10, %xmm6
    vpand   %xmm7, %xmm11, %xmm7
    vpmovsxdq   %xmm6, %ymm9
    vpmovsxdq   %xmm5, %ymm5
    vpmovsxdq   %xmm7, %ymm6
    vpmovsxdq   %xmm0, %ymm0
    vpaddq  %ymm5, %ymm1, %ymm1
    vpaddq  %ymm2, %ymm9, %ymm2
    vpaddq  %ymm6, %ymm3, %ymm3
    vpaddq  %ymm0, %ymm4, %ymm4
    cmpq    $32768, %rcx            # imm = 0x8000
    jne .LBB0_4
like image 863
Eric Stotch Avatar asked Apr 25 '20 00:04

Eric Stotch


People also ask

Why is sorted array faster?

In C++, it is faster to process a sorted array than an unsorted array because of branch prediction. In computer architecture, a branch prediction determines whether a conditional branch (jump) in the instruction flow of a program is likely to be taken or not. Branch prediction doesn't play a significant role here.

How much faster is branchless programming?

So, the branchless version is almost twice as fast as the branching version on my system (3.4 GHz.


Video Answer


1 Answers

It's almost certainly branch prediction. The hardware will try to figure out how your conditions will evaluate and get that branch of code ready, and one method it leverages is previous iterations through loops. Since your sorted array means it's going to have a bunch of false paths, then a bunch of true paths, this means the only time it's going to miss is when the loop starts and when data[c] first becomes >= 128.

In addition, it's possible the compiler is smart enough to optimize on a sorted array. I haven't tested, but it could save the magic c value, do a lookup on that once, and then have fewer iterations of your inner for. This would be an optimization you could write yourself as well.

like image 59
Stephen Newell Avatar answered Oct 22 '22 15:10

Stephen Newell