Is (really) '

Question

This is the typical question to be DV, so I've hesitated (a lot) before posting it...

I know this question was marked as duplicate, but my tests (if they are good: are they good? this is part of the question) tends to show this is not the case.

At the beginning, I did some tests comparing a for loop to a while loop.

That shows that for loop was better.

But going further, for or while was not the point: the difference is related to:

for (int l = 0; l < loops;l++) {

or

for (int l = 0; l != loops;l++) {

And if you run that (under Windows 10, Visual studio 2017, release), you see that the first one is more than twice faster than the second one.

It is hard (for a novice as I am) to understand if the compiler for some reasons is able to optimize more one or another. But...

Short question

Why?

Longer question

The complete code is the following:

For the '<' loop:

int forloop_inf(int loops, int iterations)
{
    int n = 0;
    int x = n;

    for (int l = 0; l < loops;l++) {
        for (int i = 0; i < iterations;i++) {
            n++;
            x += n;
        }
    }

    return x;
}

For the '!=' loop:

int forloop_diff(int loops, int iterations)
{
    int n = 0;
    int x = n;

    for (int l = 0; l != loops;l++) {
        for (int i = 0; i != iterations;i++) {
            n++;
            x += n;
        }
    }

    return x;
}

In both cases, the inner calculation is just here in order to avoid the compiler skipping all the loops.

Respectively this is called by:

printf("for loop inf %f
", monitor_int(loops, iterations, forloop_inf, &result));
printf("%d
", result);

and

printf("for loop diff %f
", monitor_int(loops, iterations, forloop_diff, &result));
printf("%d
", result);

where loops = 10 * 1000 and iterations = 1000 * 1000.

And where monitor_int is:

double monitor_int(int loops, int iterations, int(*func)(int, int), int *result)
{
    clock_t start = clock();

    *result = func(loops, iterations);

    clock_t stop = clock();

    return (double)(stop - start) / CLOCKS_PER_SEC;
}

The result in seconds is:

for loop inf 2.227 seconds
for loop diff 4.558 seconds

So, even if the interest of all that is relative to the weight of what is done inside the loop compared to the loop itself, why such a difference?

Edit:

You can find here the full source code reviewed so that functions are called in a random order several times.

The corresponding disassembly is here (obtained with dumpbin /DISASM CPerf2.exe).

Running it, I now obtain:

'!=' 0.045231 (average on 493 runs)
'<' 0.031010 (average on 507 runs)

I do not know how to set O3 in Visual Studio, the compile command line is the following:

/permissive- /Yu"stdafx.h" /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"x64\Release\vc141.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oi /MD /FC /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Ot /Fp"x64\Release\CPerf2.pch" /diagnostics:classic

The code for the loops is above, here is the random way to run it:

typedef int(loop_signature)(int, int);

void loops_compare()
{
    int loops = 1 * 100;
    int iterations = 1000 * 1000;
    int result;

    loop_signature *functions[2] = {
        forloop_diff,
        forloop_inf
    };

    int n_rand = 1000;

    int n[2] = { 0, 0 };
    double cum[2] = { 0.0, 0.0 };

    for (int i = 0; i < n_rand;i++) {
        int pick = rand() % 2;
        loop_signature *fun = functions[pick];

        double time = monitor(loops, iterations, fun, &result);
        n[pick]++;
        cum[pick] += time;
    }

    printf("'!=' %f (%d) / '<' %f (%d)
", cum[0] / (double)n[0], n[0], cum[1] / (double)n[1], n[1]);
}

and the disassembly (the loop functions only, but not sure it is the good extract of the link above):

?forloop_inf@@YAHHH@Z:
  0000000140001000: 48 83 EC 08        sub         rsp,8
  0000000140001004: 45 33 C0           xor         r8d,r8d
  0000000140001007: 45 33 D2           xor         r10d,r10d
  000000014000100A: 44 8B DA           mov         r11d,edx
  000000014000100D: 85 C9              test        ecx,ecx
  000000014000100F: 7E 6F              jle         0000000140001080
  0000000140001011: 48 89 1C 24        mov         qword ptr [rsp],rbx
  0000000140001015: 8B D9              mov         ebx,ecx
  0000000140001017: 66 0F 1F 84 00 00  nop         word ptr [rax+rax]
                    00 00 00
  0000000140001020: 45 33 C9           xor         r9d,r9d
  0000000140001023: 33 D2              xor         edx,edx
  0000000140001025: 33 C0              xor         eax,eax
  0000000140001027: 41 83 FB 02        cmp         r11d,2
  000000014000102B: 7C 29              jl          0000000140001056
  000000014000102D: 41 8D 43 FE        lea         eax,[r11-2]
  0000000140001031: D1 E8              shr         eax,1
  0000000140001033: FF C0              inc         eax
  0000000140001035: 8B C8              mov         ecx,eax
  0000000140001037: 03 C0              add         eax,eax
  0000000140001039: 0F 1F 80 00 00 00  nop         dword ptr [rax]
                    00
  0000000140001040: 41 FF C1           inc         r9d
  0000000140001043: 83 C2 02           add         edx,2
  0000000140001046: 45 03 C8           add         r9d,r8d
  0000000140001049: 41 03 D0           add         edx,r8d
  000000014000104C: 41 83 C0 02        add         r8d,2
  0000000140001050: 48 83 E9 01        sub         rcx,1
  0000000140001054: 75 EA              jne         0000000140001040
  0000000140001056: 41 3B C3           cmp         eax,r11d
  0000000140001059: 7D 06              jge         0000000140001061
  000000014000105B: 41 FF C2           inc         r10d
  000000014000105E: 45 03 D0           add         r10d,r8d
  0000000140001061: 42 8D 0C 0A        lea         ecx,[rdx+r9]
  0000000140001065: 44 03 D1           add         r10d,ecx
  0000000140001068: 41 8D 48 01        lea         ecx,[r8+1]
  000000014000106C: 41 3B C3           cmp         eax,r11d
  000000014000106F: 41 0F 4D C8        cmovge      ecx,r8d
  0000000140001073: 44 8B C1           mov         r8d,ecx
  0000000140001076: 48 83 EB 01        sub         rbx,1
  000000014000107A: 75 A4              jne         0000000140001020
  000000014000107C: 48 8B 1C 24        mov         rbx,qword ptr [rsp]
  0000000140001080: 41 8B C2           mov         eax,r10d
  0000000140001083: 48 83 C4 08        add         rsp,8
  0000000140001087: C3                 ret
  0000000140001088: CC CC CC CC CC CC CC CC                          ÌÌÌÌÌÌÌÌ
?forloop_diff@@YAHHH@Z:
  0000000140001090: 45 33 C0           xor         r8d,r8d
  0000000140001093: 41 8B C0           mov         eax,r8d
  0000000140001096: 85 C9              test        ecx,ecx
  0000000140001098: 74 28              je          00000001400010C2
  000000014000109A: 44 8B C9           mov         r9d,ecx
  000000014000109D: 0F 1F 00           nop         dword ptr [rax]
  00000001400010A0: 85 D2              test        edx,edx
  00000001400010A2: 74 18              je          00000001400010BC
  00000001400010A4: 8B CA              mov         ecx,edx
  00000001400010A6: 66 66 0F 1F 84 00  nop         word ptr [rax+rax]
                    00 00 00 00
  00000001400010B0: 41 FF C0           inc         r8d
  00000001400010B3: 41 03 C0           add         eax,r8d
  00000001400010B6: 48 83 E9 01        sub         rcx,1
  00000001400010BA: 75 F4              jne         00000001400010B0
  00000001400010BC: 49 83 E9 01        sub         r9,1
  00000001400010C0: 75 DE              jne         00000001400010A0
  00000001400010C2: C3                 ret
00000001400010C3: CC CC CC CC CC CC CC CC CC CC CC CC CC ÌÌÌÌÌÌÌÌÌÌÌÌÌ

Edit again:

What I feel surprising is also the following:

In debug more the performance is the same (and the assembly code too)
So how to be confident about what you're coding if such differences appear after that? (considering I've done no mistake somewhere)

Schwern · Accepted Answer

For proper benchmarking, it's important to run the functions in random order and many times.

typedef int(signature)(int, int);

...

int main() {
    int loops, iterations, runs;

    fprintf(stderr, "Loops: ");
    scanf("%d", &loops);
    fprintf(stderr, "Iterations: ");
    scanf("%d", &iterations);
    fprintf(stderr, "Runs: ");
    scanf("%d", &runs);

    fprintf(stderr, "Running for %d loops and %d iterations %d times.
", loops, iterations, runs);

    signature *functions[2] = {
        forloop_inf,
        forloop_diff
    };

    int result = functions[0](loops, iterations);
    for( int i = 0; i < runs; i++ ) {
        int pick = rand() % 2;
        signature *function = functions[pick];

        int new_result;
        printf("%d %f
", pick, monitor_int(loops, iterations, function, &new_result));
        if( result != new_result ) {
            fprintf(stderr, "got %d expected %d
", new_result, result);
        }
    }
}

Armed with this we can do 1000 runs in random order and find the average times.

It's also important to benchmark with optimizations turned on. Not much point in asking how fast unoptimized code will run. I'll try at -O2 and -O3.

My findings are that with Apple LLVM version 8.0.0 (clang-800.0.42.1) doing 10000 loops and 1000000 iterations at -O2 forloop_inf is indeed about 50% faster than forloop_diff.

forloop_inf: 0.000009
forloop_diff: 0.000014

Looking at the generated assembly code for -O2 with clang -O2 -S -mllvm --x86-asm-syntax=intel test.c I can see many differences between the two implementations. Maybe somebody who knows assembly can tell us why.

But at -O3 the performance difference is no longer discernible.

forloop_inf: 0.000002
forloop_diff: 0.000002

This is because at -O3 they are almost exactly the same. One is using je and one is using jle. That's it.

In conclusion, when benchmarking...

Do many runs.
Randomize the order.
Compile and run as close as you can do how you would in production.
- In this case that means turning on compiler optimizations.
Look at the assembly code.

And most of all.

Pick the safest code, not the fastest.

i < max is safer than i != max because it will still terminate if i somehow jumps over max.

As demonstrated, with optimizations turned on, they're both so fast that even not fully optimized they can whip through 10,000,000,000 iterations in 0.000009 seconds. i < max or i != max is unlikely to be a performance bottleneck, rather whatever you're doing 10 billion times is.

But i != max could lead to a bug.

Is (really) '<' faster than '!=' in C?

Tags:

c

lemon

1 Answers

Schwern

Recent Activity

Donate For Us