Why does sorting make this branchless code faster?

Tags:

c++

-Edit2- Seems like occasionally my CPU runs unsorted as fast as sorted. On other machines they're consistently the same speed. I guess I'm not benchmarking correctly or there's subtle things going on behind the scene

-Edit- ASM code below. The generated code is the same in the main loop. Can someone confirm if it's faster or the same? I'm using clang 10 and I'm on ryzen 3600. According to compiler explorer there is no branches, the adding uses a SIMD instruction that adds either A or B based on the mask. The mask is from the >= 128 and B is a vector of 0's. So no there is no hidden branch from what I can see.

I thought this very popular question was silly and tried it in clang Why is processing a sorted array faster than processing an unsorted array?

With g++ -O2 it takes 2seconds with clang it took 0.32. I tried making it branchless like the below and found that even though it's branchless sorting still makes it faster. Whats going on?! (also it seems like O2 vectorize the code so I didn't try to do more)

#include <algorithm>
#include <ctime>
#include <stdio.h>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster.
    std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            bool v = data[c] >= 128;
            sum += data[c] * v;
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    printf("%f\nsum = %lld\n", elapsedTime, sum);
}

.LBB0_4:                                #   Parent Loop BB0_3 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
    vmovdqa (%rsp,%rcx,4), %xmm5
    vmovdqa 16(%rsp,%rcx,4), %xmm6
    vmovdqa 32(%rsp,%rcx,4), %xmm7
    vmovdqa 48(%rsp,%rcx,4), %xmm0
    addq    $16, %rcx
    vpcmpgtd    %xmm8, %xmm5, %xmm9
    vpcmpgtd    %xmm8, %xmm6, %xmm10
    vpcmpgtd    %xmm8, %xmm7, %xmm11
    vpcmpgtd    %xmm8, %xmm0, %xmm12
    vpand   %xmm5, %xmm9, %xmm5
    vpand   %xmm0, %xmm12, %xmm0
    vpand   %xmm6, %xmm10, %xmm6
    vpand   %xmm7, %xmm11, %xmm7
    vpmovsxdq   %xmm6, %ymm9
    vpmovsxdq   %xmm5, %ymm5
    vpmovsxdq   %xmm7, %ymm6
    vpmovsxdq   %xmm0, %ymm0
    vpaddq  %ymm5, %ymm1, %ymm1
    vpaddq  %ymm2, %ymm9, %ymm2
    vpaddq  %ymm6, %ymm3, %ymm3
    vpaddq  %ymm0, %ymm4, %ymm4
    cmpq    $32768, %rcx            # imm = 0x8000
    jne .LBB0_4

863

asked Apr 25 '20 00:04

Eric Stotch

Video Answer

1 Answers

It's almost certainly branch prediction. The hardware will try to figure out how your conditions will evaluate and get that branch of code ready, and one method it leverages is previous iterations through loops. Since your sorted array means it's going to have a bunch of false paths, then a bunch of true paths, this means the only time it's going to miss is when the loop starts and when data[c] first becomes >= 128.

In addition, it's possible the compiler is smart enough to optimize on a sorted array. I haven't tested, but it could save the magic c value, do a lookup on that once, and then have fewer iterations of your inner for. This would be an optimization you could write yourself as well.

answered Oct 22 '22 15:10

Stephen Newell

Related questions
                            
                                Custom CRecordset class does not call DoFieldExchange() when useMultiRowFetch is specified
                            
                                Which functions are affected by -fno-math-errno?
                            
                                Will a thread sleep if told to sleep for zero seconds?
                            
                                Convert signature from P1363 to ASN.1/DER format using Crypto++?
                            
                                Free memory with explicit size
                            
                                How does memory on the heap get exhausted?
                            
                                How to generate a n-sized random float array that sums up to 0.0?
                            
                                Anaconda ImportError: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found
                            
                                Is cppreference using the term "[Object's] identity" is two different meanings for c++11 and for c++17?
                            
                                what are the overload resolution rules of list-initialization
                            
                                Error when casting temporary object to non-const reference
                            
                                Check if class is of template specialization (with template arguments like bool or int)
                            
                                How can i write my own RPC Implementation for Protocol Buffers utilizing ZeroMQ
                            
                                Compile big C++ project that uses CMake to WebAssembly
                            
                                Fast generation of random derangements
                            
                                How many const variables can I declare before running out of memory?
                            
                                Invalid data when creating mkv container with h264 stream because extradata is null
                            
                                Templated delegating copy constructor in constant expressions
                            
                                Avoiding repeated C++ virtual table lookup
                            
                                Why does template parameter unpacking sometimes not work for std::function?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With