Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why can't we use the preprocessor to create custom-delimited strings?

I was playing around a bit with the C preprocessor, when something which seemed so simple failed:

#define STR_START "
#define STR_END "

int puts(const char *);

int main() {
    puts(STR_START hello world STR_END);
}

When I compile it with gcc (note: similar errors with clang), it fails, with these errors:

$ gcc test.c
test.c:1:19: warning: missing terminating " character
test.c:2:17: warning: missing terminating " character
test.c: In function ‘main’:
test.c:7: error: missing terminating " character
test.c:7: error: ‘hello’ undeclared (first use in this function)
test.c:7: error: (Each undeclared identifier is reported only once
test.c:7: error: for each function it appears in.)
test.c:7: error: expected ‘)’ before ‘world’
test.c:7: error: missing terminating " character

Which sort of confused me, so I ran it through the pre-processor:

$ gcc -E test.c
# 1 "test.c"
# 1 ""
# 1 ""
# 1 "test.c"
test.c:1:19: warning: missing terminating " character
test.c:2:17: warning: missing terminating " character

int puts(const char *);

int main() {
    puts(" hello world ");
}

Which, despite the warnings, produces completely valid code (in the bolded text)!

If, macros in C are simply a textual replace, why is it that my initial example would fail? Is this a compiler bug? If not, where in the standards does it have information pertaining to this scenario?

Note: I am not looking for how to make my initial snippet compile. I am simply looking for info on why this scenario fails.

like image 348
Richard J. Ross III Avatar asked May 28 '13 18:05

Richard J. Ross III


2 Answers

The problem is that even though the code expands to " hello, world ", it's not being recognized as a single string literal token by the preprocessor; instead, it's being recognized as the (invalid) sequence of tokens ", hello, ,, world, ".

N1570:

6.4 Lexical elements
...
3 A token is the minimal lexical element of the language in translation phases 7 and 8. The categories of tokens are: keywords, identifiers, constants, string literals, and punctuators. A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing tokens are: header names, identifiers, preprocessing numbers, character constants, string literals, punctuators, and single non-white-space characters that do not lexically match the other preprocessing token categories.69)If a ' or a " character matches the last category, the behavior is undefined. Preprocessing tokens can be separated by white space; this consists of comments (described later), or white-space characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. As described in 6.10, in certain circumstances during translation phase 4, white space (or the absence thereof) serves as more than preprocessing token separation. White space may appear within a preprocessing token only as part of a header name or between the quotation characters in a character constant or string literal.
69) An additional category, placemarkers, is used internally in translation phase 4 (see 6.10.3.3); it cannot occur in source files.

Note that neither ' nor " are punctuators under this definition.

like image 90
John Bode Avatar answered Nov 15 '22 23:11

John Bode


The preprocessor runs in multiple phases. Phase 3, tokenization, occurs before expansion, so preprocessor macros must represent full tokens. In your case, STR_START and STR_END are tokenized and then substituted, which makes those tokens invalid.

like image 20
Mike Pelley Avatar answered Nov 15 '22 23:11

Mike Pelley