While trying to benchmark some options for my code (using 128 bit integers or not) I observed a behavior I just can't understand. Could anybody shed some light on this?
#include <stdio.h>
#include <stdint.h>
#include <time.h>
int main(int a, char** b)
{
printf("Running tests\n");
clock_t start = clock();
unsigned __int128 t = 13;
for(unsigned long i = 0; i < (1UL<<30); i++)
t += 23442*t + 25;
if(t == 0) printf("0\n");
printf("u128, +25, took %fs\n", double(clock() - start)/CLOCKS_PER_SEC);
start = clock();
t = 13;
for(unsigned long i = 0; i < (1UL<<30); i++)
t += 23442*t;
if(t == 0) printf("0\n");
printf("u128, no+, took %fs\n", double(clock() - start)/CLOCKS_PER_SEC);
start = clock();
unsigned long u = 13;
for(unsigned long i = 0; i < (1UL<<30); i++)
u += 23442*u + 25;
if(u == 0) printf("0\n");
printf("u64 , +25, took %fs\n", double(clock() - start)/CLOCKS_PER_SEC);
start = clock();
u = 13;
for(unsigned long i = 0; i < (1UL<<30); i++)
u += 23442*u;
if(u == 0) printf("0\n");
printf("u64 , no+, took %fs\n", double(clock() - start)/CLOCKS_PER_SEC);
return 0;
}
(Note that the printf are here so that gcc does not optimize out the for loop) On my system, this reliably produces the following output:
u128, +25, took 2.411922s
u128, no+, took 1.799805s
u64 , +25, took 1.797960s
u64 , no+, took 2.454104s
While the 128 bit integer behavior makes sense, I fail to see how the 64 bit loop with less operations performs significantly (30%) slower.
Is that a known behavior? What would be the general rule when trying to benefit from this optimization when writing loops of this kind?
Edit: the behavior is only observed when compiling with -O3 option.
gcc -lstdc++ -O3 -o a main.cpp
u128, +25, took 2.413949s
u128, no+, took 1.799469s
u64 , +25, took 1.798278s
u64 , no+, took 2.453414s
gcc -lstdc++ -O2 -o a main.cpp
u128, +25, took 2.415244s
u128, no+, took 1.800499s
u64 , +25, took 1.798699s
u64 , no+, took 1.348133s
The loop is so tight that dependency stall, ALU busy etc comes to play and dominate the timing. The result is thus not reliable and more sensitive to other factors than actual instruction execution.
Note that +25 can be calculated in parallel along with the multiply.
PS. My result on 4970K:
gcc version 5.2.1 20151010
gcc -lstdc++ -O2 -o a a.cpp
u128, +25, took 1.346360s
u128, no+, took 1.022965s
u64 , +25, took 1.020189s
u64 , no+, took 0.765725s
EDIT: After look into the disassemble on -O2
and -O3
, the major difference is then on the code generation. (Above reason still hold on -O2 over different test machines/environment yielding slightly different results)
-O2:
400618: 48 69 d2 93 5b 00 00 imul $0x5b93,%rdx,%rdx
40061f: 48 83 e8 01 sub $0x1,%rax
400623: 75 f3 jne 400618 <_Z4testv+0x18>
-O3:
400628: 66 0f 6f d9 movdqa %xmm1,%xmm3
40062c: 83 c0 01 add $0x1,%eax
40062f: 66 0f 6f c1 movdqa %xmm1,%xmm0
400633: 66 0f f4 cc pmuludq %xmm4,%xmm1
400637: 3d 00 00 00 20 cmp $0x20000000,%eax
40063c: 66 0f f4 da pmuludq %xmm2,%xmm3
400640: 66 0f 73 d0 20 psrlq $0x20,%xmm0
....
O3 generate vectorized code while the loop has heavy dependency which cannot get value out of vectorization. It actually generated much more complex code and thus has a much longer timing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With