int main(){\
int a = 5;\
return a;\
}
Above compiles fine. I assume the C preprocessor removes the backslashes before compilation?
output of gcc -E:
int main(){
int a = 5;
return a;}
It seems like not all the \n
(new line) characters get removed similar to how it's done with Macros, it just mainly removed the backslash.
I have seen this used in multiline macros such as:
#define TEST(in)\
int a = in; \
int b = 6;
int main(){
TEST(5)
return 0;
}
output of gcc -E:
int main(){
int a = 5; int b = 6;
return 0;
}
Preprocess will remove the backslash as well as the \n
character in the above example, but why is it not removing all the new line characters in my first example?
"Splices" -- backslash newline sequences -- are removed before the preprocessor processes the program text. At least that's the theory, bearing in mind that the C standard does not actually define a process called the "preprocessor".
What it does define is a procedure for converting the program text into a stream of tokens which can be parsed, and then turning that into an executable. The procedure consists of eight translation phases, and the compiler must produce the same result as would be produced if the phases were executed one at a time, each one taking as input the output of the previous phase. (Most of the inputs and output are streams of tokens, rather than character strings. So the output GCC produces when run with the -E
flag doesn't correspond to anything in the standard, allowing GCC to basically produce whatever output it finds convenient. Or that its authors thought you would find convenient.)
The "as if" clause means that a particular compiler can combine phases or execute them in pieces, as long as it doesn't change the result. So you can really only look at the process as the abstract description of an algorithm. Still, it's useful to understand. The full text is found in §5.1.1.2 of the standard.
A highly condensed and commented description of the phases, which is incomplete and somewhat imprecise in its details, in the hopes that it's easier to digest than the language in the standard. But do read it in the original.
Remove trigraphs (which are now deprecated, so don't worry if you don't know what they are) and, if necessary, convert the program text to whatever character encoding the compiler requires.
Remove splices. All backslash-newline sequences are simply removed from the program text, leaving nothing behind. (OK, that's the theory. In practice, most compilers still know the original source line number of every bit of text. But this information is only used for producing diagnostics.)
Split the text into tokens and whitespace sequences, and replace all comments with a single space character.
"Preprocessing directives are executed, macro invocations are expanded, and _Pragma
unary operator expressions are executed". This is as close as the standard gets to defining the preprocessor, so it's reasonable to say that the "preprocessor" is the execution of phase 4. #include
directives are preprocessor directives, and processing the include directive starts with passing the included file through phases 1-3 before inserting it into the token stream to be further preprocessed.
Replace all the escape sequences in character and string literals with the actual characters (possibly wide characters) which will be used during execution.
Concatenate adjacent string literals.
Remove all whitespace, leaving only tokens. Convert preprocessing tokens into syntactic tokens. Parse the resulting token stream and convert it into a "translation unit". Or, in other words, compile the program into an object file (although that's way more specific than the language in the standard).
Combine all the translation units and necessary library modules into a single executable image. Informally, this is the linking phase and the result is something you can hand to the operating system for execution.
That's what the standard mandates. But real-world compilers do lots of other stuff, like generate more or less readable error messages; rearrange the code in ways that might make it execute faster and/or occupy less space; insert debugging information into the executable; and produce whatever additional analyses and reports the user has requested (none of which are standardised). This, for example, includes the -E
and/or -S
outputs. The compiler does these things as a favour to you, and they can be helpful in understanding the way your program was compiled. But you shouldn't take them too seriously, since the official result of the compilation process is the actual executable.
Most compilation toolchains can also produce libraries, so it is not the case that all programs are immediately fully processed into executable images. But that's the only outcome which is standardised. Although the standard refers to libraries, particularly the standard library, it does not make any assumptions about how libraries come into existence.
The standard libraries (and headers) don't even have to exist in the filesystem; it's enough that the compiler recognises their names and responds appropriately. Some of the stuff the standard library has to implement cannot be written in portable C, so it is quite possible that the standard library source code, if it exists, is not all in the form of a standard C program. Standard library headers might include constructs which receive special handling by the compiler, and thus cannot be used by other compilers or copied directly into your program.
This might all seem too much in the air, but the intention was to make it possible to have C implementations which run on extremely limited processors, including processors without any external storage at all. (And it is still quite common to target embedded systems which might be missing lots of things you normally take for granted.) And, on the whole, it's served us pretty well over the years.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With