I am learning x86 assembler in order to write a compiler. In particular, I'm taking a variety of simple recursive functions and feeding them through different compilers (OCaml, GCC etc.) in order to get a better understanding of the kinds of assembler generated by different compilers.
I've got a trivial recursive integer Fibonacci function:
int fib(int x) { return (x < 2 ? x : fib(x-1)+fib(x-2)); }
My hand-coded assembly looks like this:
fib:
cmp eax, 2
jl fin
push eax
dec eax
call fib
push eax
mov eax, [esp+4]
add eax, -2
call fib
add eax, [esp]
add esp, 8
fin:
ret
Compiling that function to Intel-syntax assembler using gcc -O2
produces this enigmatic code:
_fib:
push edi
push esi
push ebx
sub esp, 16
mov edi, DWORD PTR [esp+32]
cmp edi, 1
jle L4
mov ebx, edi
xor esi, esi
L3:
lea eax, [ebx-1]
mov DWORD PTR [esp], eax
call _fib
sub ebx, 2
add esi, eax
cmp ebx, 1
jg L3
and edi, 1
L2:
lea eax, [esi+edi]
add esp, 16
pop ebx
pop esi
pop edi
ret
L4:
xor esi, esi
jmp L2
So I guess the calling convention is argument at [esp+4]
and return value in eax
. It starts by pushing edi
, esi
and ebx
. Then it claims another 16 bytes for a stack frame, enough for 4 temporary ints. Then edi
is read from [esp+32]
, which is the argument. If the argument is <=1
then it jumps to L4
which zeroes out (?) esi
before jumping back to L2
which sets eax=esi+edi
which is just the argument edi
. If the argument was >1
then the argument is copied into ebx
and zeroes esi
before falling through into L3
. In L3
, it sets eax=ebx-1
and stores the result (n-1) at esp
in the stack frame before recursing to calculate fib(n-1)
. The result is added to esi
, ebx
is set to n-2
and it loops back to L3
if ebx>1
otherwise it extracts the lower bit of edi
before falling through to L2
.
Why is this code so convoluted (e.g. is there a name for an optimization that has been done that I'm not seeing?)?
The recursive calls fib(n-2)
seem to have been replaced with a loop accumulating in esi
but that call wasn't in tail position so how was this done?
What is the purpose of the and edi, 1
?
What is the purpose of the mov DWORD PTR [esp], eax
?
Why is the stack frame so large?
Can you disassemble this algorithm back into C to make it clearer what is going on?
My preliminary impression is that GCC generates pretty poor x86 assembler. In this case, over 2× more code for equal performance (3.25s for fib(40) on this 1.6GHz Atom for both programs). Is that fair?
In addition to the comments above, note that the recursion has been unwound into a tail call by replacing:
return x < 2 ? x : fib(x - 2) + fib(x - 1);
with:
if ((xprime = x) < 2) {
acc = 0;
} else {
/* at this point we know x >= 2 */
acc = 0; /* start with 0 */
while (x > 1) {
acc += fib(x - 1); /* add fib(x-1) */
x -= 2; /* now we'll add fib(x-2) */
}
/* so at this point we know either x==1 or x==0 */
xprime = x == 1 ? 1 : 0; /* ie, x & 1 */
}
return xprime + acc;
I suspect this rather tricky loop arose from multiple optimization steps, not that I have fiddled with gcc optimization since about gcc 2.3 (it's all very different inside now!).
Quite simply, fib(x-2)
is equal to fib(x-3) + fib(x-4)
, fib(x-4)
is equal to fib(x-5) + fib(x-6)
etc. so fib(x)
is being calculated as fib(x-1) + fib(x-3) + fib(x-5) + ... + fib(x&1)
(fib(x&1)
equals x&1
) i.e. gcc has inlined the call to fib(x-2)
, which is quite a clever thing to do to a recursive function.
This first part is ensuring registers that should be preserved according to the calling convention are not trashed. I would guess the calling convention used here is cdecl
.
_fib:
push edi
push esi
push ebx
sub esp, 16
DWORD PTR[esp+32]
is your x
:
mov edi, DWORD PTR [esp+32]
cmp edi, 1
jle L4
If x
is less than or equal to 1 (this corresponds to your x < 2
), then go to L4
which is:
L4:
xor esi, esi
jmp L2
This zeroes out esi
and branches to L2
:
L2:
lea eax, [esi+edi]
add esp, 16
pop ebx
pop esi
pop edi
ret
This sets eax
(the return value) with esi+edi
. Since esi
is 0 already, edi
is just loaded in the case of 0 and 1. This corresponds to x < 2 ? x
.
Now we look at the case when x
is not < 2
:
mov ebx, edi
xor esi, esi
L3:
lea eax, [ebx-1]
mov DWORD PTR [esp], eax
call _fib
First, x
is copied to ebx
, then esi
is zeroed.
Next, eax
is set to x - 1
. This value is moved to the top of the stack and _fib
called. This corresponds to fib(x-1)
.
sub ebx, 2
add esi, eax
This subtracts 2 from ebx
(x
). Then eax
(return value from _fib
call is added to esi
, which was set to 0 before). Hence esi
now holds the result of fib(x-1)
.
cmp ebx, 1
jg L3
and edi, 1
ebx
is compared to 1. If it is greater than 1, then we loop back to L3
. Otherwise (the case where it is 0 or 1), we perform and edi, 1
and fall through to L2
(we analysed what this does earlier already). The and edi, 1
is equivalent to performing a %2
on x
.
From a high level, this is what the code does:
x < 2
, then return x
.fib(x-...)
until x
is smaller than 2. In this case, fall through to the x < 2
case.The optimization is that GCC unwinds the cases where x >= 2
by doing them in a loop instead of recursively.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With