Why can't we use the preprocessor to create custom-delimited strings?

Question

I was playing around a bit with the C preprocessor, when something which seemed so simple failed:

#define STR_START "
#define STR_END "

int puts(const char *);

int main() {
    puts(STR_START hello world STR_END);
}

When I compile it with gcc (note: similar errors with clang), it fails, with these errors:

$ gcc test.c
test.c:1:19: warning: missing terminating " character
test.c:2:17: warning: missing terminating " character
test.c: In function ‘main’:
test.c:7: error: missing terminating " character
test.c:7: error: ‘hello’ undeclared (first use in this function)
test.c:7: error: (Each undeclared identifier is reported only once
test.c:7: error: for each function it appears in.)
test.c:7: error: expected ‘)’ before ‘world’
test.c:7: error: missing terminating " character

Which sort of confused me, so I ran it through the pre-processor:

$ gcc -E test.c
# 1 "test.c"
# 1 ""
# 1 ""
# 1 "test.c"
test.c:1:19: warning: missing terminating " character
test.c:2:17: warning: missing terminating " character

int puts(const char *);

int main() {
    puts(" hello world ");
}

Which, despite the warnings, produces completely valid code (in the bolded text)!

If, macros in C are simply a textual replace, why is it that my initial example would fail? Is this a compiler bug? If not, where in the standards does it have information pertaining to this scenario?

_{Note: I am not looking for how to make my initial snippet compile. I am simply looking for info on why this scenario fails.}

John Bode · Accepted Answer

The problem is that even though the code expands to " hello, world ", it's not being recognized as a single string literal token by the preprocessor; instead, it's being recognized as the (invalid) sequence of tokens ", hello, ,, world, ".

N1570:

6.4 Lexical elements
...
3 A token is the minimal lexical element of the language in translation phases 7 and 8. The categories of tokens are: keywords, identiﬁers, constants, string literals, and punctuators. A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing tokens are: header names, identiﬁers, preprocessing numbers, character constants, string literals, punctuators, and single non-white-space characters that do not lexically match the other preprocessing token categories.⁶⁹⁾If a ' or a " character matches the last category, the behavior is undeﬁned. Preprocessing tokens can be separated by white space; this consists of comments (described later), or white-space characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. As described in 6.10, in certain circumstances during translation phase 4, white space (or the absence thereof) serves as more than preprocessing token separation. White space may appear within a preprocessing token only as part of a header name or between the quotation characters in a character constant or string literal.
^{69) An additional category, placemarkers, is used internally in translation phase 4 (see 6.10.3.3); it cannot
occur in source ﬁles.}

Note that neither ' nor " are punctuators under this definition.

Mike Pelley · Answer

The preprocessor runs in multiple phases. Phase 3, tokenization, occurs before expansion, so preprocessor macros must represent full tokens. In your case, STR_START and STR_END are tokenized and then substituted, which makes those tokens invalid.

Why can't we use the preprocessor to create custom-delimited strings?

Tags:

c

c-preprocessor

gcc

macros

clang

Richard J. Ross III

2 Answers

John Bode

Mike Pelley

Recent Activity

Donate For Us

Why can't we use the preprocessor to create custom-delimited strings?

Tags:

c

c-preprocessor

gcc

macros

clang

Richard J. Ross III

2 Answers

John Bode

Mike Pelley

Related questions

Recent Activity

Donate For Us