Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What do C and Assembler actually compile to? [closed]

So I found out that C(++) programs actually don't compile to plain "binary" (I may have gotten some things wrong here, in that case I'm sorry :D) but to a range of things (symbol table, os-related stuff,...) but...

  • Does assembler "compile" to pure binary? That means no extra stuff besides resources like predefined strings, etc.

  • If C compiles to something else than plain binary, how can that small assembler bootloader just copy the instructions from the HDD to memory and execute them? I mean if the OS kernel, which is probably written in C, compiles to something different than plain binary - how does the bootloader handle it?

edit: I know that assembler doesn't "compile" because it only has your machine's instruction set - I didn't find a good word for what assembler "assembles" to. If you have one, leave it here as comment and I'll change it.

like image 978
lamas Avatar asked Jan 25 '10 21:01

lamas


People also ask

What does C code compile to?

C is a compiled language. Its source code is written using any editor of a programmer's choice in the form of a text file, then it has to be compiled into machine code.

What does assembly language compile to?

Computer Languages As with assembly language, a compiled language is translated directly into machine-readable binary code by a special program called a compiler. The result is a program file that can then be subsequently run without needing to refer to the human-readable source code.

Does C compile down to assembly?

c is being compiled without any additional links or files, so the program will be converted into an executable by itself. Alas, for the grand finale, let's run the full shebang (no computing reference intended here), gcc main. c without any options, to preprocess, compile, assemble, and link the program all at once.

Does C compile to assembly or machine code?

Related: Does a compiler always produce an assembly code? - no, big mainstream C compilers that provide a complete toolchain often go straight to machine code, especially ones (unlike GCC) that only target a few ISAs / object file formats.


2 Answers

C typically compiles to assembler, just because that makes life easy for the poor compiler writer.

Assembly code always assembles (not "compiles") to relocatable object code. You can think of this as binary machine code and binary data, but with lots of decoration and metadata. The key parts are:

  • Code and data appear in named "sections".

  • Relocatable object files may include definitions of labels, which refer to locations within the sections.

  • Relocatable object files may include "holes" that are to be filled with the values of labels defined elsewhere. The official name for such a hole is a relocation entry.

For example, if you compile and assemble (but don't link) this program

int main () { printf("Hello, world\n"); } 

you are likely to wind up with a relocatable object file with

  • A text section containing the machine code for main

  • A label definition for main which points to the beginning of the text section

  • A rodata (read-only data) section containing the bytes of the string literal "Hello, world\n"

  • A relocation entry that depends on printf and that points to a "hole" in a call instruction in the middle of a text section.

If you are on a Unix system a relocatable object file is generally called a .o file, as in hello.o, and you can explore the label definitions and uses with a simple tool called nm, and you can get more detailed information from a somewhat more complicated tool called objdump.

I teach a class that covers these topics, and I have students write an assembler and linker, which takes a couple of weeks, but when they've done that most of them have a pretty good handle on relocatable object code. It's not such an easy thing.

like image 96
Norman Ramsey Avatar answered Sep 23 '22 23:09

Norman Ramsey


Let's take a C program.

When you run gcc, clang, or 'cl' on the c program, it will go through these stages:

  1. Preprocessor (#include, #ifdef, trigraph analysis, encoding translations, comment management, macros...) including lexing into preprocessor tokens and eventually resulting in flat text for input to the compiler proper.
  2. Lexical analysis (producing tokens and lexical errors).
  3. Syntactical analysis (producing a parse tree and syntactical errors).
  4. Semantic analysis (producing a symbol table, scoping information and scoping/typing errors) Also data-flow, transforming the program logic into an "intermediate representation" that the optimizer can work with. (Often an SSA). clang/LLVM uses LLVM-IR, gcc uses GIMPLE then RTL.
  5. Optimization of the program logic, including constant propagation, inlining, hoisting invariants out of loops, auto-vectorization, and many many other things. (Most of the code for a widely-used modern compiler is optimization passes.) Transforming through intermediate representations is just part of how some compilers work, making it impossible / meaningless to "disable all optimizations"
  6. Outputing into assembly source (or another intermediate format like .NET IL bytecode)
  7. Assembling of the assembly into some binary object format.
  8. Linking of the assembly into whatever static libraries are needed, as well as relocating it if needed.
  9. Output of final executable in elf, PE/coff, MachO64, or whatever other format

In practice, some of these steps may be done at the same time, but this is the logical order. Most compilers have options to stop after any given step (e.g. preprocess or asm), including dumping internal representation between optimization passes for open-source compilers like GCC. (-ftree-dump-...)

Note that there's a 'container' of elf or coff format around the actual executable binary, unless it's a DOS .com executable

You will find that a book on compilers(I recommend the Dragon book, the standard introductory book in the field) will have all the information you need and more.

As Marco commented, linking and loading is a large area and the Dragon book more or less stops at the output of the executable binary. To actually go from there to running on an operating system is a decently complex process, which Levine in Linkers and Loaders covers.

I've wiki'd this answer to let people tweak any errors/add information.

like image 45
5 revs, 3 users 71% Avatar answered Sep 25 '22 23:09

5 revs, 3 users 71%