A compiler takes the program code (source code) and converts the source code to a machine language module (called an object file). Another specialized program, called a linker, combines this object file with other previously compiled object files (in particular run-time modules) to create an executable file.
In order to convert assembly language into machine code it needs to be translated using an assembler . This converts each statement into the specific machine code needed for the hardware on which it is being run.
An assembler converts assembly language into machine language. A disassembler converts machine language into assembly.
gcc actually produces assembler and assembles it using the as assembler. Not all compilers do this - the MS compilers produce object code directly, though you can make them generate assembler output. Translating assembler to object code is a pretty simple process, at least compared with compilation.
Some compilers produce other high-level language code as their output - for example, cfront, the first C++ compiler produced C as its output which was then compiled by a C compiler.
Note that neither direct compilation or assembly actually produce an executable. That is done by the linker, which takes the various object code files produced by compilation/assembly, resolves all the names they contain and produces the final executable binary.
Almost all compilers, including gcc, produce assembly code because it's easier---both to produce and to debug the compiler. The major exceptions are usually just-in-time compilers or interactive compilers, whose authors don't want the performance overhead or the hassle of forking a whole process to run the assembler. Some interesting examples include
Standard ML of New Jersey, which runs interactively and compiles every expression on the fly.
The tinycc compiler, which is designed to be fast enough to compile, load, and run a C script in well under 100 milliseconds, and therefore doesn't want the overhead of calling the assembler and linker.
What these cases have in common is a desire for "instantaneous" response. Assemblers and linkers are plenty fast, but not quite good enough for interactive response. Yet.
There are also a large family of languages, such as Smalltalk, Java, and Lua, which compile to bytecode, not assembly code, but whose implementations may later translate that bytecode directly to machine code without benefit of an assembler.
(Footnote: in the early 1990s, Mary Fernandez and I wrote the New Jersey Machine Code Toolkit, for which the code is online, which generates C libraries that compiler writers can use to bypass the standard assembler and linker. Mary used it to roughly double the speed of her optimizing linker when generating a.out
. If you don't write to disk, speedups are even greater...)
According to chapter 2 of Introduction to Reverse Engineering Software (by Mike Perry and Nasko Oskov), both gcc and cl.exe (the back end compiler for MSVC++) have the -S switch you can use to output the assembly that each compiler produces.
You can also run gcc in verbose mode (gcc -v
) to get a list of commands that it executes to see what it's doing behind the scenes.
Compilers, in general, parse the source code into an Abstract Syntax Tree (an AST), then into some intermediate language. Only then, usually after some optimizations, they emit the target language.
About gcc, it can compile to a wide variety of targets. I don't know if for x86 it compiles to assembly first, but I did give you some insight onto compilers - and you asked for that too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With