I discovered this popular ~9-year-old SO question and decided to double-check its outcomes.
So, I have AMD Ryzen 9 5950X, clang++ 10 and Linux, I copy-pasted code from the question and here is what I got:
Sorted - 0.549702s:
~/d/so_sorting_faster$ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
std::sort(data, data + arraySize);
0.549702
sum = 314931600000
Unsorted - 0.546554s:
~/d/so_sorting_faster $ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
// std::sort(data, data + arraySize);
0.546554
sum = 314931600000
I am pretty sure that the fact that unsorted version turned out to be faster by 3ms is just noise, but it seems it is not slower anymore.
So, what has changed in the architecture of CPU (so that it is not an order of magnitude slower anymore)?
Here are results from multiple runs:
Unsorted: 0.543557 0.551147 0.541722 0.555599
Sorted: 0.542587 0.559719 0.53938 0.557909
Just in case, here is my main.cpp:
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! With this, the next loop runs faster.
// std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
// Primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << std::endl;
std::cout << "sum = " << sum << std::endl;
return 0;
}
Update
With larger number of elements (627680):
Unsorted
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
// std::sort(data, data + arraySize);
10.3814
Sorted:
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
std::sort(data, data + arraySize);
10.6885
I think the question is still relevant - almost no difference.
In C++, it is faster to process a sorted array than an unsorted array because of branch prediction. In computer architecture, a branch prediction determines whether a conditional branch (jump) in the instruction flow of a program is likely to be taken or not. Branch prediction doesn't play a significant role here.
In short, searching in an unsorted array takes O(n) time: you potentially have to look at every item to find out if what you're looking for is there. A sorted array lets you speed up the search. Instead of having to examine every item, you only have to examine at most log2(n) items.
The major advantage of an ordered array is that the search times have time complexity of O(log n), compared to that of an unordered array, which is O (n).
2. Binary Search. Sequential search is the best that we can do when trying to find a value in an unsorted array.
Several of the answers in the question you link talk about rewriting the code to be branchless and thus avoiding any branch prediction issues. That's what your updated compiler is doing.
Specifically, clang++ 10 with -O3
vectorizes the inner loop. See the code on godbolt, lines 36-67 of the assembly. The code is a little bit complicated, but one thing you definitely don't see is any conditional branch on the data[c] >= 128
test. Instead it uses vector compare instructions (pcmpgtd
) whose output is a mask with 1s for matching elements and 0s for non-matching. The subsequent pand
with this mask replaces the non-matching elements by 0, so that they do not contribute anything when unconditionally added to the sum.
The rough C++ equivalent would be
sum += data[c] & -(data[c] >= 128);
The code actually keeps two running 64-bit sum
s, for the even and odd elements of the array, so that they can be accumulated in parallel and then added together at the end of the loop.
Some of the extra complexity is to take care of sign-extending the 32-bit data
elements to 64 bits; that's what sequences like pxor xmm5, xmm5 ; pcmpgtd xmm5, xmm4 ; punpckldq xmm4, xmm5
accomplish. Turn on -mavx2
and you'll see a simpler vpmovsxdq ymm5, xmm5
in its place.
The code also looks long because the loop has been unrolled, processing 8 elements of data
per iteration.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With