While looking at some questions on optimization, this accepted answer for the question on coding practices for most effective use of the optimizer piqued my curiosity. The assertion is that local variables should be used for computations in a function, not output arguments. It was suggested this would allow the compiler to make additional optimizations otherwise not possible.
So, writing a simple bit of code for the example Foo class and compiling the code fragments with g++ v4.4 and -O2 gave some assembler output (use -S). The parts of the assembler listing with just the loop portion shown below. On examination of the output, it seems the loop is nearly identical for both, with just a difference in one address. That address being a pointer to the output argument for the first example or the local variable for the second.
There seems to no change in the actual effect whether the local variable is used or not. So the question breaks down to 3 parts:
a) is GCC not doing additional optimization, even given the hint suggested;
b) is GCC successfully optimizing in both cases, but should not be;
c) is GCC successfully optimizing in both cases, and is producing compliant output as defined by the C++ standard?
Here is the unoptimized function:
void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
for (int i=0; i<numFoo, i++)
{
barOut.munge(foo1, foo2[i]);
}
}
And corresponding assembly:
.L3:
movl (%esi), %eax
addl $1, %ebx
addl $4, %esi
movl %eax, 8(%esp)
movl (%edi), %eax
movl %eax, 4(%esp)
movl 20(%ebp), %eax ; Note address is that of the output argument
movl %eax, (%esp)
call _ZN3Foo5mungeES_S_
cmpl %ebx, 16(%ebp)
jg .L3
Here is the re-written function:
void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
Foo barTemp = barOut;
for (int i=0; i<numFoo, i++)
{
barTemp.munge(foo1, foo2[i]);
}
barOut = barTemp;
}
And here is the compiler output for the function using a local variable:
.L3:
movl (%esi), %eax ; Load foo2[i] pointer into EAX
addl $1, %ebx ; increment i
addl $4, %esi ; increment foo2[i] (32-bit system, 8 on 64-bit systems)
movl %eax, 8(%esp) ; PUSH foo2[i] onto stack (careful! from EAX, not ESI)
movl (%edi), %eax ; Load foo1 pointer into EAX
movl %eax, 4(%esp) ; PUSH foo1
leal -28(%ebp), %eax ; Load barTemp pointer into EAX
movl %eax, (%esp) ; PUSH the this pointer for barTemp
call _ZN3Foo5mungeES_S_ ; munge()!
cmpl %ebx, 16(%ebp) ; i < numFoo
jg .L3 ; recall incrementing i by one coming into the loop
; so test if greater
The compiler optimizes to reduce the size of the binary instead of execution speed. If you do not specify an optimization option, gcc attempts to reduce the compilation time and to make debugging always yield the result expected from reading the source code.
GCC has a range of optimization levels, plus individual options to enable or disable particular optimizations. The overall compiler optimization level is controlled by the command line option -On, where n is the required optimization level, as follows: -O0 . (default).
No. GCC passes your assembly source through the preprocessor and then to the assembler. At no time are any optimisations performed.
The example given in that answer was not a very good one because of the call to an unknown function the compiler cannot reason much about. Here's a better example:
void FillOneA(int *array, int length, int& startIndex)
{
for (int i = 0; i < length; i++) array[startIndex + i] = 1;
}
void FillOneB(int *array, int length, int& startIndex)
{
int localIndex = startIndex;
for (int i = 0; i < length; i++) array[localIndex + i] = 1;
}
The first version optimizes poorly because it needs to protect against the possibility that somebody called it as
int array[10] = { 0 };
FillOneA(array, 5, array[1]);
resulting in {1, 1, 0, 1, 1, 1, 0, 0, 0, 0 }
since the iteration with i=1
modifies the startIndex
parameter.
The second one doesn't need to worry about the possibility that the array[localIndex + i] = 1
will modify localIndex
because localIndex
is a local variable whose address has never been taken.
In assembly (Intel notation, because that's what I use):
FillOneA:
mov edx, [esp+8]
xor eax, eax
test edx, edx
jle $b
push esi
mov esi, [esp+16]
push edi
mov edi, [esp+12]
$a: mov ecx, [esi]
add ecx, eax
inc eax
mov [edi+ecx*4], 1
cmp eax, edx
jl $a
pop edi
pop esi
$b: ret
FillOneB:
mov ecx, [esp+8]
mov eax, [esp+12]
mov edx, [eax]
test ecx, ecx
jle $a
mov eax, [esp+4]
push edi
lea edi, [eax+edx*4]
mov eax, 1
rep stosd
pop edi
$a: ret
ADDED: Here's an example where the compiler's insight is into Bar, and not munge:
class Bar
{
public:
float getValue() const
{
return valueBase * boost;
}
private:
float valueBase;
float boost;
};
class Foo
{
public:
void munge(float adjustment);
};
void Adjust10A(Foo& foo, const Bar& bar)
{
for (int i = 0; i < 10; i++)
foo.munge(bar.getValue());
}
void Adjust10B(Foo& foo, const Bar& bar)
{
Bar localBar = bar;
for (int i = 0; i < 10; i++)
foo.munge(localBar.getValue());
}
The resulting code is
Adjust10A:
push ecx
push ebx
mov ebx, [esp+12] ;; foo
push esi
mov esi, [esp+20] ;; bar
push edi
mov edi, 10
$a: fld [esi+4] ;; bar.valueBase
push ecx
fmul [esi] ;; valueBase * boost
mov ecx, ebx
fstp [esp+16]
fld [esp+16]
fstp [esp]
call Foo::munge
dec edi
jne $a
pop edi
pop esi
pop ebx
pop ecx
ret 0
Adjust10B:
sub esp, 8
mov ecx, [esp+16] ;; bar
mov eax, [ecx] ;; bar.valueBase
mov [esp], eax ;; localBar.valueBase
fld [esp] ;; localBar.valueBase
mov eax, [ecx+4] ;; bar.boost
mov [esp+4], eax ;; localBar.boost
fmul [esp+4] ;; localBar.getValue()
push esi
push edi
mov edi, [esp+20] ;; foo
fstp [esp+24]
fld [esp+24] ;; cache localBar.getValue()
mov esi, 10 ;; loop counter
$a: push ecx
mov ecx, edi ;; foo
fstp [esp] ;; use cached value
call Foo::munge
fld [esp]
dec esi
jne $a ;; loop
pop edi
fstp ST(0)
pop esi
add esp, 8
ret 0
Observe that the inner loop in Adjust10A
must recalculate the value since it must protect against the possibility that foo.munge
changed bar
.
That said, this style of optimization is not a slam dunk. (For example, we could've gotten the same effect by manually caching bar.getValue()
into localValue
.) It tends to be most helpful for vectorized operations, since those can be paralellized.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With