Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed of accessing local vs. global variables in gcc/g++ at different optimization levels

I found that different compiler optimization levels in gcc give quite different results when accessing a local or a global variable in a loop. The reason this surprised me is that if access to one type of variable is more optimizable than access to another, I would think gcc optimization would exploit that fact. Here come two examples (in C++ but their C counterparts give practically the same timings):

    global = 0;
    for (int i = 0; i < SIZE; i++)
        global++;

which uses a global variable long global, versus

    long tmp = 0;
    for (int i = 0; i < SIZE; i++)
        tmp++;
    global = tmp;

At optimization level -O0 the timing is essentially equal (as I would expect), at -O1 it is somewhat faster but still equal, but from -O2 the version using the global variable is much faster (a factor 7 or so).

On the other hand, in the following code fragments where start points to a block of bytes of size SIZE:

    global = 0;
    for (const char* p = start; p < start + SIZE; p++)
        global += *p;

versus

    long tmp = 0;
    for (const char* p = start; p < start + SIZE; p++)
        tmp += *p;
    global = tmp;

Here at -O0 the timings are close, though the version using the local variable is slightly faster, which doesn't seem too surprising, as maybe it will be stored in a register, whereas global wouldn't. Then at -O1 and higher the version using a local variable is considerably faster (more than 50% or 1.5 times). As remarked before, this surprises me, because I would think that for gcc it would be as easy as for me to use a local variable (in the generated optimized code) to assign to the global one later on.

So my question is: what is it about global and local variables that makes that gcc can only perform certain optimizations to one type, not the other?

Some details that may or may not be relevant: I used gcc/g++ version 3.4.5 on a machine running RHEL4 with two single core processors and 4GB RAM. The value I used for SIZE, which is a preprocessor macro, was 1000000000. The block of bytes in the second example was dynamically allocated.

Here are some timing outputs for optimization levels 0 to 4 (in the same order as above):

$ ./st0
Result using global variable: 1000000000 in 2.213 seconds.
Result using local variable:  1000000000 in 2.210 seconds.
Result using global variable: 0 in 3.924 seconds.
Result using local variable:  0 in 3.710 seconds.
$ ./st1
Result using global variable: 1000000000 in 0.947 seconds.
Result using local variable:  1000000000 in 0.947 seconds.
Result using global variable: 0 in 2.135 seconds.
Result using local variable:  0 in 1.212 seconds.
$ ./st2
Result using global variable: 1000000000 in 0.022 seconds.
Result using local variable:  1000000000 in 0.552 seconds.
Result using global variable: 0 in 2.135 seconds.
Result using local variable:  0 in 1.227 seconds.
$ ./st3
Result using global variable: 1000000000 in 0.065 seconds.
Result using local variable:  1000000000 in 0.461 seconds.
Result using global variable: 0 in 2.453 seconds.
Result using local variable:  0 in 1.646 seconds.
$ ./st4
Result using global variable: 1000000000 in 0.063 seconds.
Result using local variable:  1000000000 in 0.468 seconds.
Result using global variable: 0 in 2.467 seconds.
Result using local variable:  0 in 1.663 seconds.

EDIT This is the generated assembly for the first two snippets with switch -O2, the case where the difference is largest. For as far as I understand, it looks like a bug in the compiler: 0x3b9aca00 is SIZE in hexadecimal, 0x80496dc must be the address of global. I checked with a newer compiler, and this doesn't happen anymore. The difference in the second pair of snippets is similar however.

    void global1()
    {
        int i;
        global = 0;
        for (i = 0; i < SIZE; i++)
            global++;
    }

    void local1()
    {
        int i;
        long tmp = 0;
        for (i = 0; i < SIZE; i++)
            tmp++;
        global = tmp;
    }

    080483d0 <global1>:
     80483d0:   55                      push   %ebp
     80483d1:   89 e5                   mov    %esp,%ebp
     80483d3:   c7 05 dc 96 04 08 00    movl   $0x0,0x80496dc
     80483da:   00 00 00 
     80483dd:   b8 ff c9 9a 3b          mov    $0x3b9ac9ff,%eax
     80483e2:   89 f6                   mov    %esi,%esi
     80483e4:   83 e8 19                sub    $0x19,%eax
     80483e7:   79 fb                   jns    80483e4 <global1+0x14>
     80483e9:   c7 05 dc 96 04 08 00    movl   $0x3b9aca00,0x80496dc
     80483f0:   ca 9a 3b 
     80483f3:   c9                      leave  
     80483f4:   c3                      ret    
     80483f5:   8d 76 00                lea    0x0(%esi),%esi

    080483f8 <local1>:
     80483f8:   55                      push   %ebp
     80483f9:   89 e5                   mov    %esp,%ebp
     80483fb:   b8 ff c9 9a 3b          mov    $0x3b9ac9ff,%eax
     8048400:   48                      dec    %eax
     8048401:   79 fd                   jns    8048400 <local1+0x8>
     8048403:   c7 05 dc 96 04 08 00    movl   $0x3b9aca00,0x80496dc
     804840a:   ca 9a 3b 
     804840d:   c9                      leave  
     804840e:   c3                      ret    
     804840f:   90                      nop    

Finally here is the code of the remaining snippets, now generated by gcc 4.3.3 using -O3 (though the old version seems to generate similar code). It looks like indeed global2(..) compiles to a function accessing the global memory location in every iteration of the loop, where local2(..) uses a register. It is still not clear to me why gcc wouldn't optimize the global version using a register anyway. Is this just a lacking feature, or would it really lead to unacceptable behaviour of the executable?

    void global2(const char* start)
    {
        const char* p;
        global = 0;
        for (p = start; p < start + SIZE; p++)
            global += *p;
    }

    void local2(const char* start)
    {
        const char* p;
        long tmp = 0;
        for (p = start; p < start + SIZE; p++)
            tmp += *p;
        global = tmp;
    }

    08048470 <global2>:
     8048470:   55                      push   %ebp
     8048471:   31 d2                   xor    %edx,%edx
     8048473:   89 e5                   mov    %esp,%ebp
     8048475:   8b 4d 08                mov    0x8(%ebp),%ecx
     8048478:   c7 05 24 a0 04 08 00    movl   $0x0,0x804a024
     804847f:   00 00 00 
     8048482:   8d b6 00 00 00 00       lea    0x0(%esi),%esi
     8048488:   0f be 04 11             movsbl (%ecx,%edx,1),%eax
     804848c:   83 c2 01                add    $0x1,%edx
     804848f:   01 05 24 a0 04 08       add    %eax,0x804a024
     8048495:   81 fa 00 ca 9a 3b       cmp    $0x3b9aca00,%edx
     804849b:   75 eb                   jne    8048488 <global2+0x18>
     804849d:   5d                      pop    %ebp
     804849e:   c3                      ret    
     804849f:   90                      nop    

    080484a0 <local2>:
     80484a0:   55                      push   %ebp
     80484a1:   31 c9                   xor    %ecx,%ecx
     80484a3:   89 e5                   mov    %esp,%ebp
     80484a5:   31 d2                   xor    %edx,%edx
     80484a7:   53                      push   %ebx
     80484a8:   8b 5d 08                mov    0x8(%ebp),%ebx
     80484ab:   90                      nop    
     80484ac:   8d 74 26 00             lea    0x0(%esi,%eiz,1),%esi
     80484b0:   0f be 04 13             movsbl (%ebx,%edx,1),%eax
     80484b4:   83 c2 01                add    $0x1,%edx
     80484b7:   01 c1                   add    %eax,%ecx
     80484b9:   81 fa 00 ca 9a 3b       cmp    $0x3b9aca00,%edx
     80484bf:   75 ef                   jne    80484b0 <local2+0x10>
     80484c1:   5b                      pop    %ebx
     80484c2:   89 0d 24 a0 04 08       mov    %ecx,0x804a024
     80484c8:   5d                      pop    %ebp
     80484c9:   c3                      ret    
     80484ca:   8d b6 00 00 00 00       lea    0x0(%esi),%esi

Thanks.

like image 286
doetoe Avatar asked Aug 30 '11 09:08

doetoe


People also ask

Are local variables faster than global variables?

Conclusion: In most cases, local variables will be faster than global variables.

Which has higher priority global or local variable?

Within the body of a function, a local variable takes precedence over a global variable with the same name.

Do global variables make programs run faster?

Short answer - No, good programmers make code go faster by knowing and using the appropriate tools for the job, and then optimizing in a methodical way where their code does not meet their requirements.

Are global variables slow in C?

Global variables are really slow, in addition to all the other reasons not to use them.


2 Answers

A local variable tmp whose address is not taken cannot be pointed to by the pointer p, and the compiler can optimize accordingly. It is much more difficult to infer that a global variable global is not pointed to, unless it's static, because the address of that global variable could be taken in another compilation unit and passed around.

If reading the assembly indicates that the compiler forces itself to load from memory more often than you would expect, and you know that the aliasing it worries about cannot exist in practice, you can help it by copying the global variable into a local variable at the top of the function and using only the local in the rest of the function.

Finally, note that if pointer p had been of another type, the compiler could have invoked "strict aliasing rules" to optimize regardless of its inability to infer that p does not point to global. But because lvalues of type char are often used to observe the representation of other types, there is an allowance for this kind of alias, and the compiler cannot take this shortcut in your example.

like image 123
Pascal Cuoq Avatar answered Sep 19 '22 21:09

Pascal Cuoq


Global variable = global memory, and subject to aliasing (read as: bad for the optimizer -- must read-modify-write in the worst case).

Local variable = register (unless the compiler really can't help it, sometimes it must put it on the stack too, but the stack is practically guaranteed to be in L1)

Accessing a register is on the order of a single cycle, accessing memory is on the order of 15-1000 cycles (depending on whether the cache line is in cache and not invalidated by another core, and depending on whether the page is in the TLB).

like image 34
Damon Avatar answered Sep 20 '22 21:09

Damon