Difficulties to measure C/C++ performance

Question

I wrote a piece of C code to show a point in a discussion about optimizations and branch prediction. Then I noticed even more diverse outcome than I did expect. My goal was to write it in a language that is common subset between C++ and C, that is standard-compliant for both languages and that is fairly portable. It was tested on different Windows PCs:

#include <stdio.h>
#include <time.h>

/// @return - time difference between start and stop in milliseconds
int ms_elapsed( clock_t start, clock_t stop )
{
    return (int)( 1000.0 * ( stop - start ) / CLOCKS_PER_SEC );
}

int const Billion = 1000000000;
/// & with numbers up to Billion gives 0, 0, 2, 2 repeating pattern 
int const Pattern_0_0_2_2 = 0x40000002; 

/// @return - half of Billion  
int unpredictableIfs()
{
    int sum = 0;
    for ( int i = 0; i < Billion; ++i )
    {
        // true, true, false, false ...
        if ( ( i & Pattern_0_0_2_2 ) == 0 )
        {
            ++sum;
        }
    }
    return sum;
}

/// @return - half of Billion  
int noIfs()
{
    int sum = 0;
    for ( int i = 0; i < Billion; ++i )
    {
        // 1, 1, 0, 0 ...
        sum += ( i & Pattern_0_0_2_2 ) == 0;
    }
    return sum;
}

int main()
{
    clock_t volatile start;
    clock_t volatile stop;
    int volatile sum;
    printf( "Puzzling measurements:
" );

    start = clock();
    sum = unpredictableIfs();
    stop = clock();
    printf( "Unpredictable ifs took %d msec; answer was %d
"
          , ms_elapsed(start, stop), sum );

    start = clock();
    sum = unpredictableIfs();
    stop = clock();
    printf( "Unpredictable ifs took %d msec; answer was %d
"
          , ms_elapsed(start, stop), sum );

    start = clock();
    sum = noIfs();
    stop = clock();
    printf( "Same without ifs took %d msec; answer was %d
"
          , ms_elapsed(start, stop), sum );

    start = clock();
    sum = unpredictableIfs();
    stop = clock();
    printf( "Unpredictable ifs took %d msec; answer was %d
"
          , ms_elapsed(start, stop), sum );
}

Compiled with VS2010; /O2 optimizations Intel Core 2, WinXP results:

Puzzling measurements:
Unpredictable ifs took 1344 msec; answer was 500000000
Unpredictable ifs took 1016 msec; answer was 500000000
Same without ifs took 1031 msec; answer was 500000000
Unpredictable ifs took 4797 msec; answer was 500000000

Edit: Full switches of compiler:

/Zi /nologo /W3 /WX- /O2 /Oi /Oy- /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /GS /Gy /fp:precise /Zc:wchar_t /Zc:forScope /Fp"Release\Trying.pch" /Fa"Release\" /Fo"Release\" /Fd"Release\vc100.pdb" /Gd /analyze- /errorReport:queue

Other person posted such ... Compiled with MinGW, g++ 4.71, -O1 optimizations Intel Core 2, WinXP results:

Puzzling measurements:
Unpredictable ifs took 1656 msec; answer was 500000000
Unpredictable ifs took 0 msec; answer was 500000000
Same without ifs took 1969 msec; answer was 500000000
Unpredictable ifs took 0 msec; answer was 500000000

Also he posted such results for -O3 optimizations:

Puzzling measurements:
Unpredictable ifs took 1890 msec; answer was 500000000
Unpredictable ifs took 2516 msec; answer was 500000000
Same without ifs took 1422 msec; answer was 500000000
Unpredictable ifs took 2516 msec; answer was 500000000

Now I have question. What is going on here?

More specifically ... How can a fixed function take so different amounts of time? Is there something wrong in my code? Is there something tricky with Intel processor? Are the compilers doing something odd? Can it be because of 32 bit code ran on 64 bit processor?

Thanks for attention!

Edit: I accept that g++ -O1 just reuses returned values in 2 other calls. I also accept that g++ -O2 and g++ -O3 have defect that leaves the optimization out. Significant diversity of measured speeds (450% !!!) seems still mysterious.

I looked at disassembly of code produced by VS2010. It did inline unpredictableIfs 3 times. The inlined code was fairly similar; the loop was same. It did not inline noIfs. It did roll noIfs out a bit. It takes 4 steps in one iteration. noIfs calculate like was written while unpredictableIfs use jne to jump over increment.

Daniel Fischer · Accepted Answer

With -O1, gcc-4.7.1 calls unpredictableIfs only once and resuses the result, since it recognizes that it's a pure function, so the result will be the same every time it's called. (Mine did, verified by looking at the generated assembly.)

With higher optimisation level, the functions are inlined, and the compiler doesn't recognize that it's the same code anymore, so it is run each time a function call appears in the source.

Apart from that, my gcc-4.7.1 deals best with unpredictableIfs when using -O1 or -O2 (apart from the reuse issue, both produce the same code), while noIfs is treated much better with -O3. The timings between the different runs of the same code are however consistent here - equal or differing by 10 milliseconds (granularity of clock), so I have no idea what could cause the substantially different times for unpredictableIfs you reported for -O3.

With -O2, the loop for unpredictableIfs is identical to the code generated with -O1 (except for register swapping):

.L12:
    movl    %eax, %ecx
    andl    $1073741826, %ecx
    cmpl    $1, %ecx
    adcl    $0, %edx
    addl    $1, %eax
    cmpl    $1000000000, %eax
    jne .L12

and for noIfs it's similar:

.L15:
    xorl    %ecx, %ecx
    testl   $1073741826, %eax
    sete    %cl
    addl    $1, %eax
    addl    %ecx, %edx
    cmpl    $1000000000, %eax
    jne .L15

where it was

.L7:
    testl   $1073741826, %edx
    sete    %cl
    movzbl  %cl, %ecx
    addl    %ecx, %eax
    addl    $1, %edx
    cmpl    $1000000000, %edx
    jne .L7

with -O1. Both loops run in similar time, with unpredictableIfs a bit faster.

With -O3, the loop for unpredictableIfs becomes worse,

.L14:
    leal    1(%rdx), %ecx
    testl   $1073741826, %eax
    cmove   %ecx, %edx
    addl    $1, %eax
    cmpl    $1000000000, %eax
    jne     .L14

and for noIfs (including the setup-code here), it becomes better:

    pxor    %xmm2, %xmm2
    movq    %rax, 32(%rsp)
    movdqa  .LC3(%rip), %xmm6
    xorl    %eax, %eax
    movdqa  .LC2(%rip), %xmm1
    movdqa  %xmm2, %xmm3
    movdqa  .LC4(%rip), %xmm5
    movdqa  .LC5(%rip), %xmm4
    .p2align 4,,10
    .p2align 3
.L18:
    movdqa  %xmm1, %xmm0
    addl    $1, %eax
    paddd   %xmm6, %xmm1
    cmpl    $250000000, %eax
    pand    %xmm5, %xmm0
    pcmpeqd %xmm3, %xmm0
    pand    %xmm4, %xmm0
    paddd   %xmm0, %xmm2
    jne .L18

.LC2:
    .long   0
    .long   1
    .long   2
    .long   3
    .align 16
.LC3:
    .long   4
    .long   4
    .long   4
    .long   4
    .align 16
.LC4:
    .long   1073741826
    .long   1073741826
    .long   1073741826
    .long   1073741826
    .align 16
.LC5:
    .long   1
    .long   1
    .long   1
    .long   1

it computes four iterations at once, and accordingly, noIfs runs almost four times as fast then.

Mats Petersson · Answer

Right, looking at the assembler code from gcc on 64-bit Linux, the first case, with -O1, the function UnpredictableIfs is indeed called only once, and the result reused.

With -O2 and -O3, the functions are inlined, and the time it takes should be identical. There is also no actual branches in either bit of code, but the translation for the two bits of code is somewhat different, I've cut out the lines that update "sum" [in %edx in both cases]

UnpredictableIfs:

movl    %eax, %ecx
andl    $1073741826, %ecx
cmpl    $1, %ecx
adcl    $0, %edx
addl    $1, %eax

NoIfs:

xorl    %ecx, %ecx
testl   $1073741826, %eax
sete    %cl
addl    $1, %eax
addl    %ecx, %edx

As you can see, it's not quite identical, but it does very similar things.

Difficulties to measure C/C++ performance

Tags:

c++

performance

c

optimization

measurement

Öö Tiib

2 Answers

Daniel Fischer

Mats Petersson

Recent Activity

Donate For Us

Difficulties to measure C/C++ performance

Tags:

c++

performance

c

optimization

measurement

Öö Tiib

2 Answers

Daniel Fischer

Mats Petersson

Related questions

Recent Activity

Donate For Us