I am trying to understand the assembly level code for a simple C program by inspecting it with gdb's disassembler.
Following is the C code:
#include <stdio.h>
void function(int a, int b, int c) {
char buffer1[5];
char buffer2[10];
}
void main() {
function(1,2,3);
}
Following is the disassembly code for both main
and function
gdb) disass main
Dump of assembler code for function main:
0x08048428 <main+0>: push %ebp
0x08048429 <main+1>: mov %esp,%ebp
0x0804842b <main+3>: and $0xfffffff0,%esp
0x0804842e <main+6>: sub $0x10,%esp
0x08048431 <main+9>: movl $0x3,0x8(%esp)
0x08048439 <main+17>: movl $0x2,0x4(%esp)
0x08048441 <main+25>: movl $0x1,(%esp)
0x08048448 <main+32>: call 0x8048404 <function>
0x0804844d <main+37>: leave
0x0804844e <main+38>: ret
End of assembler dump.
(gdb) disass function
Dump of assembler code for function function:
0x08048404 <function+0>: push %ebp
0x08048405 <function+1>: mov %esp,%ebp
0x08048407 <function+3>: sub $0x28,%esp
0x0804840a <function+6>: mov %gs:0x14,%eax
0x08048410 <function+12>: mov %eax,-0xc(%ebp)
0x08048413 <function+15>: xor %eax,%eax
0x08048415 <function+17>: mov -0xc(%ebp),%eax
0x08048418 <function+20>: xor %gs:0x14,%eax
0x0804841f <function+27>: je 0x8048426 <function+34>
0x08048421 <function+29>: call 0x8048340 <__stack_chk_fail@plt>
0x08048426 <function+34>: leave
0x08048427 <function+35>: ret
End of assembler dump.
I am seeking answers for following things :
The gcc compiler generates assembly language code that is then translated to machine language by the assembler as . We can also use as to translate hand-written or, in our case, compiler-written text files containing assembly language code.
An assembly language is a low-level programming language designed for a specific type of processor. It may be produced by compiling source code from a high-level programming language (such as C/C++) but can also be written from scratch. Assembly code can be converted to machine code using an assembler.
Assembly language is low-level code that relies on a strong relationship between the instructions input using the coding language and how a machine interprets the code instructions. Code is converted into executable actions using an assembler that converts input into recognizable instructions for the machine.
You can't deterministically convert assembly code to C. Interrupts, self modifying code, and other low level things have no representation other than inline assembly in C. There is only some extent to which an assembly to C process can work.
The reason for the "strange" addresses such as main+0
, main+1
, main+3
, main+6
and so on, is because each instruction takes up a variable number of bytes. For example:
main+0: push %ebp
is a one-byte instruction so the next instruction is at main+1
. On the other hand,
main+3: and $0xfffffff0,%esp
is a three-byte instruction so the next instruction after that is at main+6
.
And, since you ask in the comments why movl
seems to take a variable number of bytes, the explanation for that is as follows.
Instruction length depends not only on the opcode (such as movl
) but also the addressing modes for the operands as well (the things the opcode are operating on). I haven't checked specifically for your code but I suspect the
movl $0x1,(%esp)
instruction is probably shorter because there's no offset involved - it just uses esp
as the address. Whereas something like:
movl $0x2,0x4(%esp)
requires everything that movl $0x1,(%esp)
does, plus an extra byte for the offset 0x4
.
In fact, here's a debug session showing what I mean:
Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.
c:\pax> debug
-a
0B52:0100 mov word ptr [di],7
0B52:0104 mov word ptr [di+2],8
0B52:0109 mov word ptr [di+0],7
0B52:010E
-u100,10d
0B52:0100 C7050700 MOV WORD PTR [DI],0007
0B52:0104 C745020800 MOV WORD PTR [DI+02],0008
0B52:0109 C745000700 MOV WORD PTR [DI+00],0007
-q
c:\pax> _
You can see that the second instruction with an offset is actually different to the first one without it. It's one byte longer (5 bytes instead of 4, to hold the offset) and actually has a different encoding c745
instead of c705
.
You can also see that you can encode the first and third instruction in two different ways but they basically do the same thing.
The and $0xfffffff0,%esp
instruction is a way to force esp
to be on a specific boundary. This is used to ensure proper alignment of variables. Many memory accesses on modern processors will be more efficient if they follow the alignment rules (such as a 4-byte value having to be aligned to a 4-byte boundary). Some modern processors will even raise a fault if you don't follow these rules.
After this instruction, you're guaranteed that esp
is both less than or equal to its previous value and aligned to a 16 byte boundary.
The gs:
prefix simply means to use the gs
segment register to access memory rather than the default.
The instruction mov %eax,-0xc(%ebp)
means to take the contents of the ebp
register, subtract 12 (0xc
) and then put the value of eax
into that memory location.
Re the explanation of the code. Your function
function is basically one big no-op. The assembly generated is limited to stack frame setup and teardown, along with some stack frame corruption checking which uses the afore-mentioned %gs:14
memory location.
It loads the value from that location (probably something like 0xdeadbeef
) into the stack frame, does its job, then checks the stack to ensure it hasn't been corrupted.
Its job, in this case, is nothing. So all you see is the function administration stuff.
Stack set-up occurs between function+0
and function+12
. Everything after that is setting up the return code in eax
and tearing down the stack frame, including the corruption check.
Similarly, main
consist of stack frame set-up, pushing the parameters for function
, calling function
, tearing down the stack frame and exiting.
Comments have been inserted into the code below:
0x08048428 <main+0>: push %ebp ; save previous value.
0x08048429 <main+1>: mov %esp,%ebp ; create new stack frame.
0x0804842b <main+3>: and $0xfffffff0,%esp ; align to boundary.
0x0804842e <main+6>: sub $0x10,%esp ; make space on stack.
0x08048431 <main+9>: movl $0x3,0x8(%esp) ; push values for function.
0x08048439 <main+17>: movl $0x2,0x4(%esp)
0x08048441 <main+25>: movl $0x1,(%esp)
0x08048448 <main+32>: call 0x8048404 <function> ; and call it.
0x0804844d <main+37>: leave ; tear down frame.
0x0804844e <main+38>: ret ; and exit.
0x08048404 <func+0>: push %ebp ; save previous value.
0x08048405 <func+1>: mov %esp,%ebp ; create new stack frame.
0x08048407 <func+3>: sub $0x28,%esp ; make space on stack.
0x0804840a <func+6>: mov %gs:0x14,%eax ; get sentinel value.
0x08048410 <func+12>: mov %eax,-0xc(%ebp) ; put on stack.
0x08048413 <func+15>: xor %eax,%eax ; set return code 0.
0x08048415 <func+17>: mov -0xc(%ebp),%eax ; get sentinel from stack.
0x08048418 <func+20>: xor %gs:0x14,%eax ; compare with actual.
0x0804841f <func+27>: je <func+34> ; jump if okay.
0x08048421 <func+29>: call <_stk_chk_fl> ; otherwise corrupted stack.
0x08048426 <func+34>: leave ; tear down frame.
0x08048427 <func+35>: ret ; and exit.
I think the reason for the %gs:0x14
may be evident from above but, just in case, I'll elaborate here.
It uses this value (a sentinel) to put in the current stack frame so that, should something in the function do something silly like write 1024 bytes to a 20-byte array created on the stack or, in your case:
char buffer1[5];
strcpy (buffer1, "Hello there, my name is Pax.");
then the sentinel will be overwritten and the check at the end of the function will detect that, calling the failure function to let you know, and then probably aborting so as to avoid any other problems.
If it placed 0xdeadbeef
onto the stack and this was changed to something else, then an xor
with 0xdeadbeef
would produce a non-zero value which is detected in the code with the je
instruction.
The relevant bit is paraphrased here:
mov %gs:0x14,%eax ; get sentinel value.
mov %eax,-0xc(%ebp) ; put on stack.
;; Weave your function
;; magic here.
mov -0xc(%ebp),%eax ; get sentinel back from stack.
xor %gs:0x14,%eax ; compare with original value.
je stack_ok ; zero/equal means no corruption.
call stack_bad ; otherwise corrupted stack.
stack_ok: leave ; tear down frame.
Pax has produced a definitive answer. However, for completeness, I thought I'd add a note on getting GCC itself to show you the assembly it generates.
The -S
option to GCC tells it to stop compilation and write the assembly to a file. Normally, it either passes that file to the assembler or for some targets writes the object file directly itself.
For the sample code in the question:
#include <stdio.h>
void function(int a, int b, int c) {
char buffer1[5];
char buffer2[10];
}
void main() {
function(1,2,3);
}
the command gcc -S q3654898.c
creates a file named q3654898.s:
.file "q3654898.c" .text .globl _function .def _function; .scl 2; .type 32; .endef _function: pushl %ebp movl %esp, %ebp subl $40, %esp leave ret .def ___main; .scl 2; .type 32; .endef .globl _main .def _main; .scl 2; .type 32; .endef _main: pushl %ebp movl %esp, %ebp subl $24, %esp andl $-16, %esp movl $0, %eax addl $15, %eax addl $15, %eax shrl $4, %eax sall $4, %eax movl %eax, -4(%ebp) movl -4(%ebp), %eax call __alloca call ___main movl $3, 8(%esp) movl $2, 4(%esp) movl $1, (%esp) call _function leave ret
One thing that is evident is that my GCC (gcc (GCC) 3.4.5 (mingw-vista special r3)) doesn't include the stack check code by default. I imagine that there is a command line option, or that if I ever got around to nudging my MinGW install up to a more current GCC that it could.
Edit: Nudged to do so by Pax, here's another way to get GCC to do more of the work.
C:\Documents and Settings\Ross\My Documents\testing>gcc -Wa,-al q3654898.c q3654898.c: In function `main': q3654898.c:8: warning: return type of 'main' is not `int' GAS LISTING C:\DOCUME~1\Ross\LOCALS~1\Temp/ccLg8pWC.s page 1 1 .file "q3654898.c" 2 .text 3 .globl _function 4 .def _function; .scl 2; .type 32; .endef 5 _function: 6 0000 55 pushl %ebp 7 0001 89E5 movl %esp, %ebp 8 0003 83EC28 subl $40, %esp 9 0006 C9 leave 10 0007 C3 ret 11 .def ___main; .scl 2; .type 32; .endef 12 .globl _main 13 .def _main; .scl 2; .type 32; .endef 14 _main: 15 0008 55 pushl %ebp 16 0009 89E5 movl %esp, %ebp 17 000b 83EC18 subl $24, %esp 18 000e 83E4F0 andl $-16, %esp 19 0011 B8000000 movl $0, %eax 19 00 20 0016 83C00F addl $15, %eax 21 0019 83C00F addl $15, %eax 22 001c C1E804 shrl $4, %eax 23 001f C1E004 sall $4, %eax 24 0022 8945FC movl %eax, -4(%ebp) 25 0025 8B45FC movl -4(%ebp), %eax 26 0028 E8000000 call __alloca 26 00 27 002d E8000000 call ___main 27 00 28 0032 C7442408 movl $3, 8(%esp) 28 03000000 29 003a C7442404 movl $2, 4(%esp) 29 02000000 30 0042 C7042401 movl $1, (%esp) 30 000000 31 0049 E8B2FFFF call _function 31 FF 32 004e C9 leave 33 004f C3 ret C:\Documents and Settings\Ross\My Documents\testing>
Here we see an output listing produced by the assembler. (Its name is GAS
, because it is Gnu's version of the classic *nix assembler as
. There's humor there somewhere.)
Each line has most of the following fields: a line number, an address in the current section, bytes stored at that address, and the source text from the assembly source file.
The addresses are offsets into that portion of each section provided by this module. This particular module only has content in the .text
section which stores executable code. You will typically find mention of sections named .data
and .bss
as well. Lots of other names are used and some have special purposes. Read the manual for the linker if you really want to know.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With