So I found out that C(++) programs actually don't compile to plain "binary" (I may have gotten some things wrong here, in that case I'm sorry :D) but to a range of things (symbol table, os-related stuff,...) but...
Does assembler "compile" to pure binary? That means no extra stuff besides resources like predefined strings, etc.
If C compiles to something else than plain binary, how can that small assembler bootloader just copy the instructions from the HDD to memory and execute them? I mean if the OS kernel, which is probably written in C, compiles to something different than plain binary - how does the bootloader handle it?
edit: I know that assembler doesn't "compile" because it only has your machine's instruction set - I didn't find a good word for what assembler "assembles" to. If you have one, leave it here as comment and I'll change it.
C is a compiled language. Its source code is written using any editor of a programmer's choice in the form of a text file, then it has to be compiled into machine code.
Computer Languages As with assembly language, a compiled language is translated directly into machine-readable binary code by a special program called a compiler. The result is a program file that can then be subsequently run without needing to refer to the human-readable source code.
c is being compiled without any additional links or files, so the program will be converted into an executable by itself. Alas, for the grand finale, let's run the full shebang (no computing reference intended here), gcc main. c without any options, to preprocess, compile, assemble, and link the program all at once.
Related: Does a compiler always produce an assembly code? - no, big mainstream C compilers that provide a complete toolchain often go straight to machine code, especially ones (unlike GCC) that only target a few ISAs / object file formats.
C typically compiles to assembler, just because that makes life easy for the poor compiler writer.
Assembly code always assembles (not "compiles") to relocatable object code. You can think of this as binary machine code and binary data, but with lots of decoration and metadata. The key parts are:
Code and data appear in named "sections".
Relocatable object files may include definitions of labels, which refer to locations within the sections.
Relocatable object files may include "holes" that are to be filled with the values of labels defined elsewhere. The official name for such a hole is a relocation entry.
For example, if you compile and assemble (but don't link) this program
int main () { printf("Hello, world\n"); }
you are likely to wind up with a relocatable object file with
A text
section containing the machine code for main
A label definition for main
which points to the beginning of the text section
A rodata
(read-only data) section containing the bytes of the string literal "Hello, world\n"
A relocation entry that depends on printf
and that points to a "hole" in a call instruction in the middle of a text section.
If you are on a Unix system a relocatable object file is generally called a .o file, as in hello.o
, and you can explore the label definitions and uses with a simple tool called nm
, and you can get more detailed information from a somewhat more complicated tool called objdump
.
I teach a class that covers these topics, and I have students write an assembler and linker, which takes a couple of weeks, but when they've done that most of them have a pretty good handle on relocatable object code. It's not such an easy thing.
Let's take a C program.
When you run gcc
, clang
, or 'cl' on the c program, it will go through these stages:
In practice, some of these steps may be done at the same time, but this is the logical order. Most compilers have options to stop after any given step (e.g. preprocess or asm), including dumping internal representation between optimization passes for open-source compilers like GCC. (-ftree-dump-...
)
Note that there's a 'container' of elf or coff format around the actual executable binary, unless it's a DOS .com
executable
You will find that a book on compilers(I recommend the Dragon book, the standard introductory book in the field) will have all the information you need and more.
As Marco commented, linking and loading is a large area and the Dragon book more or less stops at the output of the executable binary. To actually go from there to running on an operating system is a decently complex process, which Levine in Linkers and Loaders covers.
I've wiki'd this answer to let people tweak any errors/add information.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With