Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does dereferencing makes my program faster?

Tags:

c

Considering the following test programs :

Loop value on the stack

int main( void ) {
    int iterations = 1000000000;

    while ( iterations > 0 )
        -- iterations;
}

Loop value on the stack (dereferenced)

int main( void ) {
    int iterations = 1000000000;
    int * p = & iterations;

    while ( * p > 0 )
        -- * p;
}

Loop value on the heap

#include <stdlib.h>

int main( void ) {
    int * p = malloc( sizeof( int ) );
    * p = 1000000000;

    while ( *p > 0 )
        -- * p;
}

By compiling them with -O0, I get the following execution times :

case1.c
real    0m2.698s
user    0m2.690s
sys     0m0.003s

case2.c
real    0m2.574s
user    0m2.567s
sys     0m0.000s

case3.c
real    0m2.566s
user    0m2.560s
sys     0m0.000s

[edit] Following is the average on 10 executions :

case1.c
2.70364

case2.c
2.57091

case3.c
2.57000

Why is the execution time bigger with the first test case, which seems to be the simplest ?

My current architecture is a x86 virtual machine (Archlinux). I get these results both with gcc (4.8.0) and clang (3.3).

[edit 1] Generated assembler codes are almost identical except that the second and third ones have more instructions than the first one.

[edit 2] These performances are reproducible (on my system). Each execution will have the same order of magnitude.

[edit 3] I don't really care about performances of a non-optimized program, but I don't understand why it would be slower, and I'm curious.

like image 414
Maël Nison Avatar asked Jul 16 '13 22:07

Maël Nison


1 Answers

It's hard to say if this is the reason since I'm doing some guessing and you haven't given some specifics (like which target you're using). But what I see when I compile without optimziations with an x86 target is the following sequences for decrementign the iterations variable:

Case 1:

L3:
    sub DWORD PTR [esp+12], 1
L2:
    cmp DWORD PTR [esp+12], 0
    jg  L3

Case 2:

L3:
    mov eax, DWORD PTR [esp+12]
    mov eax, DWORD PTR [eax]
    lea edx, [eax-1]
    mov eax, DWORD PTR [esp+12]
    mov DWORD PTR [eax], edx
L2:
    mov eax, DWORD PTR [esp+12]
    mov eax, DWORD PTR [eax]
    test    eax, eax
    jg  L3

One big difference that you see in case 1 is that the instruction at L3 reads and writes the memory location. It is followed immediately byu an instruction that reads the same memory location that was just written. This sort of sequence of instructions (the same memory location written then immediate used in the next instruction) often causes some sort of pipeline stall in modern CPUs.

You'll note that the write followed immediately by a read of the same location is not present in case 2.

Again - this answer is a bit of informed speculation.

like image 168
Michael Burr Avatar answered Oct 13 '22 19:10

Michael Burr