Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's going on in Apple LLVM-gcc x86 assembly?

I'm interested in learning more x86/x86_64 assembly. Alas, I am on a Mac. No problem, right?

$ gcc --version
i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 
5658) (LLVM build 2336.11.00)
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO 
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I wrote a simple "Hello World" in C to get a base-line on what sort of code I'll have to write. I did a little x86 back in college, and have looked up numerous tutorials, but none of them look like the freakish output I'm seeing here:

.section    __TEXT,__text,regular,pure_instructions
.globl  _main
.align  4, 0x90
_main:
Leh_func_begin1:
pushq   %rbp
Ltmp0:
movq    %rsp, %rbp
Ltmp1:
subq    $32, %rsp
Ltmp2:
movl    %edi, %eax
movl    %eax, -4(%rbp)
movq    %rsi, -16(%rbp)
leaq    L_.str(%rip), %rax
movq    %rax, %rdi
callq   _puts
movl    $0, -24(%rbp)
movl    -24(%rbp), %eax
movl    %eax, -20(%rbp)
movl    -20(%rbp), %eax
addq    $32, %rsp
popq    %rbp
ret
Leh_func_end1:

.section    __TEXT,__cstring,cstring_literals
L_.str:
.asciz   "Hello, World!"

.section    __TEXT,__eh_frame,coalesced,no_toc+strip_static_syms+live_support
EH_frame0:
Lsection_eh_frame:
Leh_frame_common:
Lset0 = Leh_frame_common_end-Leh_frame_common_begin
.long   Lset0
Leh_frame_common_begin:
.long   0
.byte   1
.asciz   "zR"
.byte   1
.byte   120
.byte   16
.byte   1
.byte   16
.byte   12
.byte   7
.byte   8
.byte   144
.byte   1
.align  3
Leh_frame_common_end:
.globl  _main.eh
_main.eh:
Lset1 = Leh_frame_end1-Leh_frame_begin1
.long   Lset1
Leh_frame_begin1:
Lset2 = Leh_frame_begin1-Leh_frame_common
.long   Lset2
Ltmp3:
.quad   Leh_func_begin1-Ltmp3
Lset3 = Leh_func_end1-Leh_func_begin1
.quad   Lset3
.byte   0
.byte   4
Lset4 = Ltmp0-Leh_func_begin1
.long   Lset4
.byte   14
.byte   16
.byte   134
.byte   2
.byte   4
Lset5 = Ltmp1-Ltmp0
.long   Lset5
.byte   13
.byte   6
.align  3
Leh_frame_end1:


.subsections_via_symbols

Now...maybe things have changed a bit, but this isn't exactly friendly, even for assembly code. I'm having a hard time wrapping my head around this...Would someone help break down what is going on in this code and why it is all needed?

Many, many thanks in advance.

like image 883
Mike Bell Avatar asked Mar 08 '13 11:03

Mike Bell


1 Answers

Since the question is really about those odd labels and data and not really about the code itself, I'm only going to shed some light on them.

If an instruction of the program causes an execution error (such as division by 0 or access to an inaccessible memory region or an attempt to execute a privileged instruction), it results in an exception (not a C++ kind of exception, rather an interrupt kind of it) and forces the CPU to execute the appropriate exception handler in the OS kernel. If we were to totally disallow these exceptions, the story would be very short, the OS would simply terminate the program.

However, there are advantages of letting programs handle their own exceptions and so the primary exception handler in the OS handler reflects some of exceptions back into the program for handling. For example, a program could attempt to recover from the exception or it could save a meaningful crash report before terminating.

In either case, it is useful to know the following:

  • the function, where the exception has occurred, not just the offending instruction in it
  • the function that called that function, the function that called that one and so on

and possibly (mainly for debugging):

  • the line of the source code file, from which this instruction was generated
  • the lines where these function calls were made
  • the function parameters

Why do we need to know the call tree?

Well, if the program registers its own exception handlers, it usually does it something like the C++ try and catch blocks:

fxn()
{
  try
  {
    // do something potentially harmful
  }
  catch()
  {
    // catch and handle attempts to do something harmful
  }
  catch()
  {
    // catch and handle attempts to do something harmful
  }
}

If neither of those catches catches, the exception propagates to the caller of fxn and potentially to the caller of the caller of fxn, until there's a catch that catches the exception or until the default exception handler that simply terminates the program.

So, you need to know the code regions that each try covers and you need to know how to get to the next closest try (in the caller of fxn, for example) if the immediate try/catch doesn't catch the exception and it has to bubble up.

The ranges for try and locations of catch blocks are easy to encode in a special section of the executable and they are easy to work with (just do a binary search for the offending instruction addresses in those ranges). But figuring out the next outer try block is harder because you may need to find out the return address from the function, where the exception occurred.

And you may not always rely on rbp+8 pointing to the return address on the stack, because the compiler may optimize the code in such a way that rbp is no longer involved in accessing function parameters and local variables. You can access them through rsp+something as well and save a register and a few instructions, but given the fact that different functions allocate different number of bytes on the stack for the locals and the parameters passed to other functions and adjust rsp differently, just the value of rsp isn't enough to find out the return address and the calling function. rsp can be an arbitrary number of bytes away from where the return address is on the stack.

For such scenarios the compiler includes additional information about functions and their stack usage in a dedicated section of the executable. The exception-handling code examines this information and properly unwinds the stack when exceptions have to propagate to the calling functions and their try/catch blocks.

So, the data following _main.eh contains that additional information. Note that it explicitly encodes the beginning and the size of main() by referring to Leh_func_begin1 and Leh_func_end1-Leh_func_begin1. This piece of info allows the exception-handling code to identify main()'s instructions as main()'s.

It also appears that main() isn't very unique and some of its stack/exception info is the same as in other functions and it makes sense to share it between them. And so there's a reference to Leh_frame_common.

I can't comment further on the structure of _main.eh and the exact meaning of those constants like 144 and 13 as I don't know the format of this data. But generally one doesn't need to know these details unless they are the compiler or the debugger developers.

I hope this give you an idea of what those labels and constants are for.

like image 86
Alexey Frunze Avatar answered Sep 21 '22 03:09

Alexey Frunze