I was playing around a bit with the C preprocessor, when something which seemed so simple failed:
#define STR_START "
#define STR_END "
int puts(const char *);
int main() {
puts(STR_START hello world STR_END);
}
When I compile it with gcc (note: similar errors with clang), it fails, with these errors:
$ gcc test.c test.c:1:19: warning: missing terminating " character test.c:2:17: warning: missing terminating " character test.c: In function ‘main’: test.c:7: error: missing terminating " character test.c:7: error: ‘hello’ undeclared (first use in this function) test.c:7: error: (Each undeclared identifier is reported only once test.c:7: error: for each function it appears in.) test.c:7: error: expected ‘)’ before ‘world’ test.c:7: error: missing terminating " character
Which sort of confused me, so I ran it through the pre-processor:
$ gcc -E test.c # 1 "test.c" # 1 "" # 1 "" # 1 "test.c" test.c:1:19: warning: missing terminating " character test.c:2:17: warning: missing terminating " character int puts(const char *); int main() { puts(" hello world "); }
Which, despite the warnings, produces completely valid code (in the bolded text)!
If, macros in C are simply a textual replace, why is it that my initial example would fail? Is this a compiler bug? If not, where in the standards does it have information pertaining to this scenario?
Note: I am not looking for how to make my initial snippet compile. I am simply looking for info on why this scenario fails.
The problem is that even though the code expands to " hello, world "
, it's not being recognized as a single string literal token by the preprocessor; instead, it's being recognized as the (invalid) sequence of tokens "
, hello
, ,
, world
, "
.
N1570:
6.4 Lexical elements
...
3 A token is the minimal lexical element of the language in translation phases 7 and 8. The categories of tokens are: keywords, identifiers, constants, string literals, and punctuators. A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing tokens are: header names, identifiers, preprocessing numbers, character constants, string literals, punctuators, and single non-white-space characters that do not lexically match the other preprocessing token categories.69)If a'
or a"
character matches the last category, the behavior is undefined. Preprocessing tokens can be separated by white space; this consists of comments (described later), or white-space characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. As described in 6.10, in certain circumstances during translation phase 4, white space (or the absence thereof) serves as more than preprocessing token separation. White space may appear within a preprocessing token only as part of a header name or between the quotation characters in a character constant or string literal.
69) An additional category, placemarkers, is used internally in translation phase 4 (see 6.10.3.3); it cannot occur in source files.
Note that neither '
nor "
are punctuators under this definition.
The preprocessor runs in multiple phases. Phase 3, tokenization, occurs before expansion, so preprocessor macros must represent full tokens. In your case, STR_START
and STR_END
are tokenized and then substituted, which makes those tokens invalid.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With