I discovered this popular ~9-year-old SO question and decided to double-check its outcomes. So, I have AMD Ryzen 9 5950X, clang++ 10 and Linux, I copy-pasted code from the question and here is what I got: Sorted - 0.549702s: <pre class="prettyprint"><code>~/d/so_sorting_faster$ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out std::sort(data, data + arraySize); 0.549702 sum = 314931600000 </code></pre> Unsorted - 0.546554s: <pre class="prettyprint"><code>~/d/so_sorting_faster $ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out // std::sort(data, data + arraySize); 0.546554 sum = 314931600000 </code></pre> I am pretty sure that the fact that unsorted version turned out to be faster by 3ms is just noise, but it seems it is not slower anymore. So, what has changed in the architecture of CPU (so that it is not an order of magnitude slower anymore)? Here are results from multiple runs: <pre class="prettyprint"><code>Unsorted: 0.543557 0.551147 0.541722 0.555599 Sorted: 0.542587 0.559719 0.53938 0.557909 </code></pre> Just in case, here is my main.cpp: <pre class="prettyprint"><code>#include <algorithm> #include <ctime> #include <iostream> int main() { // Generate data const unsigned arraySize = 32768; int data[arraySize]; for (unsigned c = 0; c < arraySize; ++c) data[c] = std::rand() % 256; // !!! With this, the next loop runs faster. // std::sort(data, data + arraySize); // Test clock_t start = clock(); long long sum = 0; for (unsigned i = 0; i < 100000; ++i) { // Primary loop for (unsigned c = 0; c < arraySize; ++c) { if (data[c] >= 128) sum += data[c]; } } double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC; std::cout << elapsedTime << std::endl; std::cout << "sum = " << sum << std::endl; return 0; } </code></pre> Update With larger number of elements (627680): <pre class="prettyprint"><code>Unsorted cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out // std::sort(data, data + arraySize); 10.3814 Sorted: cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out std::sort(data, data + arraySize); 10.6885 </code></pre> I think the question is still relevant - almost no difference.

Several of the answers in the question you link talk about rewriting the code to be branchless and thus avoiding any branch prediction issues. That's what your updated compiler is doing. Specifically, clang++ 10 with <code>-O3</code> vectorizes the inner loop. See the code on godbolt, lines 36-67 of the assembly. The code is a little bit complicated, but one thing you definitely don't see is any conditional branch on the <code>data[c] >= 128</code> test. Instead it uses vector compare instructions (<code>pcmpgtd</code>) whose output is a mask with 1s for matching elements and 0s for non-matching. The subsequent <code>pand</code> with this mask replaces the non-matching elements by 0, so that they do not contribute anything when unconditionally added to the sum. The rough C++ equivalent would be <pre class="prettyprint"><code>sum += data[c] & -(data[c] >= 128); </code></pre> The code actually keeps two running 64-bit <code>sum</code>s, for the even and odd elements of the array, so that they can be accumulated in parallel and then added together at the end of the loop. Some of the extra complexity is to take care of sign-extending the 32-bit <code>data</code> elements to 64 bits; that's what sequences like <code>pxor xmm5, xmm5 ; pcmpgtd xmm5, xmm4 ; punpckldq xmm4, xmm5</code> accomplish. Turn on <code>-mavx2</code> and you'll see a simpler <code>vpmovsxdq ymm5, xmm5</code> in its place. The code also looks long because the loop has been unrolled, processing 8 elements of <code>data</code> per iteration.

Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?

Tags:

c++

performance

cpu-architecture

branch-prediction

clang

I discovered this popular ~9-year-old SO question and decided to double-check its outcomes.

So, I have AMD Ryzen 9 5950X, clang++ 10 and Linux, I copy-pasted code from the question and here is what I got:

Sorted - 0.549702s:

~/d/so_sorting_faster$ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
    std::sort(data, data + arraySize);
0.549702
sum = 314931600000

Unsorted - 0.546554s:

~/d/so_sorting_faster $ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
    // std::sort(data, data + arraySize);
0.546554
sum = 314931600000

I am pretty sure that the fact that unsorted version turned out to be faster by 3ms is just noise, but it seems it is not slower anymore.

So, what has changed in the architecture of CPU (so that it is not an order of magnitude slower anymore)?

Here are results from multiple runs:

Unsorted: 0.543557 0.551147 0.541722 0.555599
Sorted:   0.542587 0.559719 0.53938  0.557909

Just in case, here is my main.cpp:

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster.
    // std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
    return 0;
}

Update

With larger number of elements (627680):

Unsorted
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
    // std::sort(data, data + arraySize);
10.3814

Sorted:
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
    std::sort(data, data + arraySize);
10.6885

I think the question is still relevant - almost no difference.

666

asked Oct 09 '22 14:10

DimanNe

1 Answers

Several of the answers in the question you link talk about rewriting the code to be branchless and thus avoiding any branch prediction issues. That's what your updated compiler is doing.

Specifically, clang++ 10 with -O3 vectorizes the inner loop. See the code on godbolt, lines 36-67 of the assembly. The code is a little bit complicated, but one thing you definitely don't see is any conditional branch on the data[c] >= 128 test. Instead it uses vector compare instructions (pcmpgtd) whose output is a mask with 1s for matching elements and 0s for non-matching. The subsequent pand with this mask replaces the non-matching elements by 0, so that they do not contribute anything when unconditionally added to the sum.

The rough C++ equivalent would be

sum += data[c] & -(data[c] >= 128);

The code actually keeps two running 64-bit sums, for the even and odd elements of the array, so that they can be accumulated in parallel and then added together at the end of the loop.

Some of the extra complexity is to take care of sign-extending the 32-bit data elements to 64 bits; that's what sequences like pxor xmm5, xmm5 ; pcmpgtd xmm5, xmm4 ; punpckldq xmm4, xmm5 accomplish. Turn on -mavx2 and you'll see a simpler vpmovsxdq ymm5, xmm5 in its place.

The code also looks long because the loop has been unrolled, processing 8 elements of data per iteration.

166

answered Oct 12 '22 02:10

Nate Eldredge

Related questions
                            
                                error LNK2019: unresolved external symbol _WinMain@16 referenced in function ___tmainCRTStartup
                            
                                What are the differences between -std=c++11 and -std=gnu++11?
                            
                                A free tool to check C/C++ source code against a set of coding standards? [closed]
                            
                                polymorphic_allocator: when and why should I use it?
                            
                                What happens to global and static variables in a shared library when it is dynamically linked?
                            
                                Advantages of pass-by-value and std::move over pass-by-reference
                            
                                Is the operation "false < true" well defined?
                            
                                Connecting overloaded signals and slots in Qt 5
                            
                                Officially, what is typename for?
                            
                                Is "long long" = "long long int" = "long int long" = "int long long"?
                            
                                How to determine the Boost version on a system?
                            
                                Check if a class has a member function of a given signature
                            
                                How do inline variables work?
                            
                                Is it safe to parse a /proc/ file?
                            
                                How do I expand a tuple into variadic template function's arguments?
                            
                                C++ static virtual members?
                            
                                How can I specify a [DllImport] path at runtime?
                            
                                C++ Redefinition Header Files (winsock2.h)
                            
                                Using custom std::set comparator
                            
                                Optimizing away a "while(1);" in C++0x

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With