I am trying to learn how assembly works at an elementary level and so I have been playing with the -S
output of gcc compilations. I wrote a simple program that defines two bytes and returns their sum. The entire program follows:
int main(void) {
char A = 5;
char B = 10;
return A + B;
}
When I compile this with no optimizations using:
gcc -O0 -S -c test.c
I get test.s that looks like the following:
.file "test.c"
.def ___main; .scl 2; .type 32; .endef
.text
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
LFB0:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
andl $-16, %esp
subl $16, %esp
call ___main
movb $5, 15(%esp)
movb $10, 14(%esp)
movsbl 15(%esp), %edx
movsbl 14(%esp), %eax
addl %edx, %eax
leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
LFE0:
.ident "GCC: (GNU) 4.9.2"
Now, recognizing that this program can very easily be simplified to just return a constant (15) I have been able to reduce the assembly by hand to perform the same function using this code:
.global _main
_main:
movl $15, %eax
ret
This appears to me to be the least amount of code possible (but I realize could be quite wrong) to perform this admittedly trivial task. Is this form the most "optimized" version of my C program?
Why is the initial output of GCC so much more verbose? What do the lines spanning from .cfi_startproc
to call __main
even do? What does call __main
do? I cannot figure what the two subtraction operations are for.
Even with optimizations in GCC set to -O3
I get this:
.file "test.c"
.def ___main; .scl 2; .type 32; .endef
.section .text.unlikely,"x"
LCOLDB0:
.section .text.startup,"x"
LHOTB0:
.p2align 4,,15
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
LFB0:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
andl $-16, %esp
call ___main
movl $15, %eax
leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
LFE0:
.section .text.unlikely,"x"
LCOLDE0:
.section .text.startup,"x"
LHOTE0:
.ident "GCC: (GNU) 4.9.2"
Which seems to have removed a number of operations, but still leaves all the lines leading to call __main
that seems unnecessary. What are all the .cfi_XXX
lines for? Why are so many labels added? What do .section
, .ident
, .def .p2align
, etc. do?
I understand that many of the labels and symbols are included for debugging, but shouldn't these be stripped or omitted if I am not compiling with -g enabled?
UPDATE
To clarify, by saying
This appears to me to be the least amount of code possible (but I realize could be quite wrong) to perform this admittedly trivial task. Is this form the most "optimized" version of my C program?
I am not suggesting that I am trying to, or have achieved, an optimized version of this program. I realize the program is useless and trivial. I am just using it as a tool to learn assembly and how the compiler works.
The core of why I added this bit is to illustrate why I am confused that the 4 line version of this assembly code can effectively achieve the same effect as the others. It seems to me that GCC has added a lot of "stuff" whose purpose I cannot discern.
Thank you, Kin3TiX, for asking an asm-newbie question that wasn't just a code-dump of some nasty code with no comments, and a really simple problem. :)
As a way to get your feet wet with ASM, I'd suggest working with functions OTHER than main
. e.g. just a function that takes two integer args, and adds them. Then the compiler can't optimize it away. You can still call it with constants as args, and if it's in a different file from main
, it won't get inlined, so you can even single-step through it.
There's some benefit to understanding what's going on at the asm level when you compile main
, but other than embedded systems, you're only ever going to write optimized inner loops in asm. IMO, there's little point using asm if you aren't going to optimize the hell out of it. Otherwise you probably won't beat compiler output from source which is much easier to read.
Other tips for understanding compiler output: compile withgcc -S -fno-stack-check -fverbose-asm
. The comments after each instruction are often nice reminders of what that load was for. Pretty soon it degenerates into a mess of temporaries with names like D.2983
, but something likemovq 8(%rdi), %rcx # a_1(D)->elements, a_1(D)->elements
will save you a round-trip to the ABI reference to see which function arg comes in in %rdi
, and which struct member is at offset 8.
See also How to remove "noise" from GCC/clang assembly output?
What do the lines spanning from .cfi_startproc to call__main even do?
_main:
LFB0:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
.cfi
stuff is stack-unwind info for debuggers (and C++ exception handling) to unwind the stack
It won't be there if you look at asm from objdump -d
output instead of gcc -S
, or you can use -fno-asynchronous-unwind-tables
.
The stuff with pushing %ebp
and then setting it to the value of the stack pointer on function entry sets up what's called a "stack frame". This is why %ebp
is called the base pointer. These insns won't be there if you compile with -fomit-frame-pointer
, which gives code an extra register to work with. That's on by default at -O2
. (This is huge for 32bit x86, since that takes you from 6 to 7 usable regs. (%esp
is still tied up being the stack pointer; stashing it temporarily in an xmm or mmx reg and then using it as another GP reg is possible in theory, but compilers will never do that and it makes async stuff like POSIX signals or Windows SEH unusable, as well as making debugging harder.)
The leave
instruction before the ret
is also part of this stack frame stuff.
Frame pointers are mostly historical baggage, but do make offsets into the stack frame consistent. With debug symbols, you can backtrace the call stack just fine even with -fomit-frame-pointer
, and it's the default for amd64. (The amd64 ABI has alignment requirements for the stack, is a LOT better in other ways, too. e.g. passes args in regs instead of on the stack.)
andl $-16, %esp
subl $16, %esp
The and
aligns the stack to a 16-byte boundary, regardless of what it was before. The sub
reserves 16 bytes on the stack for this function. (Notice how it's missing from the optimized version, because it optimizes away any need for memory storage of any variables.)
call ___main
__main
(asm name = ___main
) is part of cygwin: it calls constructor / init functions for shared libraries (including libc). On GNU/Linux, this is handled by _start
(before main is reached) and even dynamic-linker hooks that let libc initialize itself before the executable's own _start
is even reached. I've read that dynamic-linker hooks (or _start
from a static executable) instead of code in main
would be possible under Cygwin, but they simply choose not to do it that way.
(This old mailing list message indicates _main
is for constructors, but that main shouldn't have to call it on platforms that support getting the startup code to call it.)
movb $5, 15(%esp)
movb $10, 14(%esp)
movsbl 15(%esp), %edx
movsbl 14(%esp), %eax
addl %edx, %eax
leave
ret
Why is the initial output of GCC so much more verbose?
Without optimizations enabled, gcc maps C statements as literally as possible into asm. Doing anything else would take more compile time. Thus, movb
is from the initializers for your two variables. The return value is computed by doing two loads (with sign extension, because we need to upconvert to int BEFORE the add, to match the semantics of the C code as written, as far as overflow).
I cannot figure what the two subtraction operations are for.
There is only one sub
instruction. It reserves space on the stack for the function's variables, before the call to __main
. Which other sub are you talking about?
What do .section, .ident, .def .p2align, etc. etc. do?
See the manual for the GNU assembler. Also available locally as info pages: run info gas
.
.ident
and .def
: Looks like gcc putting its stamp on the object file, so you can tell what compiler / assembler produced it. Not relevant, ignore these.
.section
: determines what section of the ELF object file the bytes from all following instructions or data directives (e.g. .byte 0x00
) go into, until the next .section
assembler directive. Either code
(read-only, shareable), data
(initialized read/write data, private), or bss
(block storage segment. zero-initialized, doesn't take any space in the object file).
.p2align
: Power of 2 Align. Pad with nop instructions until the desired alignment. .align 16
is the same as .p2align 4
. Jump instruction are faster when the target is aligned, because of instruction fetch in chunks of 16B, not crossing a page boundary, or just not crossing a cache-line boundary. (32B alignment is relevant when code is already in the uop cache of an Intel Sandybridge and later.) See Agner Fog's docs, for example.
The core of why I added this bit is to illustrate why I am confused that the 4 line version of this assembly code can effectively achieve the same effect as the others. It seems to me that GCC has added alot of "stuff" whose purpose I cannot discern.
Put the code of interest in a function by itself. A lot of things are special about main
.
You are correct that a mov
-immediate and a ret
are all that's needed to implement the function, but gcc apparently doesn't have shortcuts for recognizing trivial whole-programs and omitting main
's stack frame or the call to _main
. >.<
Good question, though. As I said, just ignore all that crap and worry about just the small part you want to optimize.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With