-Edit2- Seems like occasionally my CPU runs unsorted as fast as sorted. On other machines they're consistently the same speed. I guess I'm not benchmarking correctly or there's subtle things going on behind the scene
-Edit- ASM code below. The generated code is the same in the main loop. Can someone confirm if it's faster or the same? I'm using clang 10 and I'm on ryzen 3600. According to compiler explorer there is no branches, the adding uses a SIMD instruction that adds either A or B based on the mask. The mask is from the >= 128 and B is a vector of 0's. So no there is no hidden branch from what I can see.
I thought this very popular question was silly and tried it in clang Why is processing a sorted array faster than processing an unsorted array?
With g++ -O2 it takes 2seconds with clang it took 0.32. I tried making it branchless like the below and found that even though it's branchless sorting still makes it faster. Whats going on?! (also it seems like O2 vectorize the code so I didn't try to do more)
#include <algorithm>
#include <ctime>
#include <stdio.h>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! With this, the next loop runs faster.
std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
// Primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
bool v = data[c] >= 128;
sum += data[c] * v;
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
printf("%f\nsum = %lld\n", elapsedTime, sum);
}
.LBB0_4: # Parent Loop BB0_3 Depth=1
# => This Inner Loop Header: Depth=2
vmovdqa (%rsp,%rcx,4), %xmm5
vmovdqa 16(%rsp,%rcx,4), %xmm6
vmovdqa 32(%rsp,%rcx,4), %xmm7
vmovdqa 48(%rsp,%rcx,4), %xmm0
addq $16, %rcx
vpcmpgtd %xmm8, %xmm5, %xmm9
vpcmpgtd %xmm8, %xmm6, %xmm10
vpcmpgtd %xmm8, %xmm7, %xmm11
vpcmpgtd %xmm8, %xmm0, %xmm12
vpand %xmm5, %xmm9, %xmm5
vpand %xmm0, %xmm12, %xmm0
vpand %xmm6, %xmm10, %xmm6
vpand %xmm7, %xmm11, %xmm7
vpmovsxdq %xmm6, %ymm9
vpmovsxdq %xmm5, %ymm5
vpmovsxdq %xmm7, %ymm6
vpmovsxdq %xmm0, %ymm0
vpaddq %ymm5, %ymm1, %ymm1
vpaddq %ymm2, %ymm9, %ymm2
vpaddq %ymm6, %ymm3, %ymm3
vpaddq %ymm0, %ymm4, %ymm4
cmpq $32768, %rcx # imm = 0x8000
jne .LBB0_4
In C++, it is faster to process a sorted array than an unsorted array because of branch prediction. In computer architecture, a branch prediction determines whether a conditional branch (jump) in the instruction flow of a program is likely to be taken or not. Branch prediction doesn't play a significant role here.
So, the branchless version is almost twice as fast as the branching version on my system (3.4 GHz.
It's almost certainly branch prediction. The hardware will try to figure out how your conditions will evaluate and get that branch of code ready, and one method it leverages is previous iterations through loops. Since your sorted array means it's going to have a bunch of false
paths, then a bunch of true
paths, this means the only time it's going to miss is when the loop starts and when data[c]
first becomes >= 128
.
In addition, it's possible the compiler is smart enough to optimize on a sorted array. I haven't tested, but it could save the magic c
value, do a lookup on that once, and then have fewer iterations of your inner for
. This would be an optimization you could write yourself as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With