Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GCC optimization: how can less operations be slower?

While trying to benchmark some options for my code (using 128 bit integers or not) I observed a behavior I just can't understand. Could anybody shed some light on this?

#include <stdio.h>
#include <stdint.h>
#include <time.h>

int main(int a, char** b)
{
    printf("Running tests\n");

    clock_t start = clock();
    unsigned __int128 t = 13;
    for(unsigned long i = 0; i < (1UL<<30); i++)
        t += 23442*t + 25;
    if(t == 0) printf("0\n");
    printf("u128, +25, took %fs\n", double(clock() - start)/CLOCKS_PER_SEC);

    start = clock();
    t = 13;
    for(unsigned long i = 0; i < (1UL<<30); i++)
        t += 23442*t;
    if(t == 0) printf("0\n");
    printf("u128, no+, took %fs\n", double(clock() - start)/CLOCKS_PER_SEC);

    start = clock();
    unsigned long u = 13;
    for(unsigned long i = 0; i < (1UL<<30); i++)
        u += 23442*u + 25;
    if(u == 0) printf("0\n");
    printf("u64 , +25, took %fs\n", double(clock() - start)/CLOCKS_PER_SEC);

    start = clock();
    u = 13;
    for(unsigned long i = 0; i < (1UL<<30); i++)
        u += 23442*u;
    if(u == 0) printf("0\n");
    printf("u64 , no+, took %fs\n", double(clock() - start)/CLOCKS_PER_SEC);

    return 0;
}

(Note that the printf are here so that gcc does not optimize out the for loop) On my system, this reliably produces the following output:

u128, +25, took 2.411922s
u128, no+, took 1.799805s
u64 , +25, took 1.797960s
u64 , no+, took 2.454104s

While the 128 bit integer behavior makes sense, I fail to see how the 64 bit loop with less operations performs significantly (30%) slower.

Is that a known behavior? What would be the general rule when trying to benefit from this optimization when writing loops of this kind?

Edit: the behavior is only observed when compiling with -O3 option.

gcc -lstdc++ -O3 -o a main.cpp

u128, +25, took 2.413949s
u128, no+, took 1.799469s
u64 , +25, took 1.798278s
u64 , no+, took 2.453414s

gcc -lstdc++ -O2 -o a main.cpp

u128, +25, took 2.415244s
u128, no+, took 1.800499s
u64 , +25, took 1.798699s
u64 , no+, took 1.348133s
like image 483
Dai Avatar asked May 06 '16 07:05

Dai


1 Answers

The loop is so tight that dependency stall, ALU busy etc comes to play and dominate the timing. The result is thus not reliable and more sensitive to other factors than actual instruction execution.

Note that +25 can be calculated in parallel along with the multiply.


PS. My result on 4970K:

gcc version 5.2.1 20151010
gcc -lstdc++ -O2 -o a a.cpp

u128, +25, took 1.346360s
u128, no+, took 1.022965s
u64 , +25, took 1.020189s
u64 , no+, took 0.765725s

EDIT: After look into the disassemble on -O2 and -O3, the major difference is then on the code generation. (Above reason still hold on -O2 over different test machines/environment yielding slightly different results)

-O2:

400618:       48 69 d2 93 5b 00 00    imul   $0x5b93,%rdx,%rdx
40061f:       48 83 e8 01             sub    $0x1,%rax
400623:       75 f3                   jne    400618 <_Z4testv+0x18>

-O3:

400628:       66 0f 6f d9             movdqa %xmm1,%xmm3
40062c:       83 c0 01                add    $0x1,%eax
40062f:       66 0f 6f c1             movdqa %xmm1,%xmm0
400633:       66 0f f4 cc             pmuludq %xmm4,%xmm1
400637:       3d 00 00 00 20          cmp    $0x20000000,%eax
40063c:       66 0f f4 da             pmuludq %xmm2,%xmm3
400640:       66 0f 73 d0 20          psrlq  $0x20,%xmm0
....

O3 generate vectorized code while the loop has heavy dependency which cannot get value out of vectorization. It actually generated much more complex code and thus has a much longer timing.

like image 158
Non-maskable Interrupt Avatar answered Sep 24 '22 02:09

Non-maskable Interrupt