If I have a multi-line string C++11 string constant such as <pre class="prettyprint"><code>R"""line 1 line 2 line3""" </code></pre> Is it defined what character(s) the line terminator/separator consist of?

The intent is that a newline in a raw string literal maps to a single <code>'\n'</code> character. This intent is not expressed as clearly as it should be, which has led to some confusion. Citations are to the 2011 ISO C++ standard. First, here's the evidence that it maps to a single <code>'\n'</code> character. A note in section 2.14.5 [lex.string] paragraph 4 says: <blockquote> [ Note: A source-file new-line in a raw string literal results in a new-line in the resulting execution string-literal. Assuming no whitespace at the beginning of lines in the following example, the assert will succeed: </blockquote> <pre class="prettyprint"><code> const char *p = R"(a\ b c)"; assert(std::strcmp(p, "a\\\nb\nc") == 0); </code></pre> <blockquote> — end note ] </blockquote> This clearly states that a newline is mapped to a single <code>'\n'</code> character. It also matches the observed behavior of g++ 6.2.0 and clang++ 3.8.1 (tests done on a Linux system using source files with Unix-style and Windows-style line endings). Given the clearly stated intent in the note and the behavior of two popular compilers, I'd say it's safe to rely on this -- though it would be interesting to see how other compilers actually handle this. However, a literal reading of the normative wording of the standard could easily lead to a different conclusion, or at least to some uncertainty. Section 2.5 [lex.pptoken] paragraph 3 says (emphasis added): <blockquote> Between the initial and final double quote characters of the raw string, any transformations performed in phases 1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted; this reversion shall apply before any d-char, r-char, or delimiting parenthesis is identified. </blockquote> The phases of translation are specified in 2.2 [lex.phases]. In phase 1: <blockquote> Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. </blockquote> If we assume that the mapping of physical source file characters to the basic character set and the introduction of new-line characters are "tranformations", we might reasonably conclude that, for example, a newline in the middle of a raw string literal in a Windows-format source file should be equivalent to a <code>\r\n</code> sequence. (I can imagine that being useful for Windows-specific code.) (This interpretation does lead to problems with systems where the end-of-line indicator is not a sequence of characters, for example where each line is a fixed-width record. Such systems are rare these days.) As "Cheers and hth. - Alf"'s answer points out, there is an open Defect Report for this issue. It was submitted in 2013 and has not yet been resolved. Personally, I think the root of the confusion is the word "any" (emphasis added as before): <blockquote> Between the initial and final double quote characters of the raw string, any transformations performed in phases 1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted; this reversion shall apply before any d-char, r-char, or delimiting parenthesis is identified. </blockquote> Surely the mapping of physical source file characters to the basic source character set can reasonably be thought of as a transformation. The parenthesized clause "(trigraphs, universal-character-names, and line splicing)" seems to be intended to specify which transformations are to be reverted, but that either attempts to change the meaning of the word "transformations" (which the standard does not formally define) or contradicts the use of the word "any". I suggest that changing the word "any" to "certain" would express the apparent intent much more clearly: <blockquote> Between the initial and final double quote characters of the raw string, certain transformations performed in phases 1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted; this reversion shall apply before any d-char, r-char, or delimiting parenthesis is identified. </blockquote> This wording would make it much clearer that "trigraphs, universal-character-names, and line splicing" are the only transformations that are to be reverted. (Not everything done in translation phases 1 and 2 is reverted, just those specific listed transformations.)

C++: Is there a standard definition for end-of-line in a multi-line string constant?

Tags:

c++

c++11

portability

If I have a multi-line string C++11 string constant such as

R"""line 1 line 2 line3"""

Is it defined what character(s) the line terminator/separator consist of?

228

asked Oct 05 '16 23:10

Mark Harrison

1 Answers

The intent is that a newline in a raw string literal maps to a single '\n' character. This intent is not expressed as clearly as it should be, which has led to some confusion.

Citations are to the 2011 ISO C++ standard.

First, here's the evidence that it maps to a single '\n' character.

A note in section 2.14.5 [lex.string] paragraph 4 says:

[ Note: A source-file new-line in a raw string literal results in a new-line in the resulting execution string-literal. Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:

    const char *p = R"(a\     b     c)";     assert(std::strcmp(p, "a\\\nb\nc") == 0);

— end note ]

This clearly states that a newline is mapped to a single '\n' character. It also matches the observed behavior of g++ 6.2.0 and clang++ 3.8.1 (tests done on a Linux system using source files with Unix-style and Windows-style line endings).

Given the clearly stated intent in the note and the behavior of two popular compilers, I'd say it's safe to rely on this -- though it would be interesting to see how other compilers actually handle this.

However, a literal reading of the normative wording of the standard could easily lead to a different conclusion, or at least to some uncertainty.

Section 2.5 [lex.pptoken] paragraph 3 says (emphasis added):

Between the initial and final double quote characters of the raw string, any transformations performed in phases 1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted; this reversion shall apply before any d-char, r-char, or delimiting parenthesis is identified.

The phases of translation are specified in 2.2 [lex.phases]. In phase 1:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary.

If we assume that the mapping of physical source file characters to the basic character set and the introduction of new-line characters are "tranformations", we might reasonably conclude that, for example, a newline in the middle of a raw string literal in a Windows-format source file should be equivalent to a \r\n sequence. (I can imagine that being useful for Windows-specific code.)

(This interpretation does lead to problems with systems where the end-of-line indicator is not a sequence of characters, for example where each line is a fixed-width record. Such systems are rare these days.)

As "Cheers and hth. - Alf"'s answer points out, there is an open Defect Report for this issue. It was submitted in 2013 and has not yet been resolved.

Personally, I think the root of the confusion is the word "any" (emphasis added as before):

Between the initial and final double quote characters of the raw string, any transformations performed in phases 1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted; this reversion shall apply before any d-char, r-char, or delimiting parenthesis is identified.

Surely the mapping of physical source file characters to the basic source character set can reasonably be thought of as a transformation. The parenthesized clause "(trigraphs, universal-character-names, and line splicing)" seems to be intended to specify which transformations are to be reverted, but that either attempts to change the meaning of the word "transformations" (which the standard does not formally define) or contradicts the use of the word "any".

I suggest that changing the word "any" to "certain" would express the apparent intent much more clearly:

Between the initial and final double quote characters of the raw string, certain transformations performed in phases 1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted; this reversion shall apply before any d-char, r-char, or delimiting parenthesis is identified.

This wording would make it much clearer that "trigraphs, universal-character-names, and line splicing" are the only transformations that are to be reverted. (Not everything done in translation phases 1 and 2 is reverted, just those specific listed transformations.)

189

answered Oct 14 '22 05:10

Keith Thompson

Related questions
                            
                                What is the best way to take screenshots of a Window with C++ in Windows?
                            
                                Does std::vector::swap invalidate iterators?
                            
                                C++: how do I check if the cin buffer is empty?
                            
                                c++ find_if lambda
                            
                                gcc-4.9 Undefined Behavior Sanitizer
                            
                                Is std::move really needed on initialization list of constructor for heavy members passed by value?
                            
                                Omit return type in C++11
                            
                                Copying derived entities using only base class pointers, (without exhaustive testing!) - C++
                            
                                Why does this work? Illogical array access
                            
                                Python-like loop enumeration in C++ [duplicate]
                            
                                How can I access my Window object properties from C++ while using QQmlApplicationEngine?
                            
                                C++ polymorphism without pointers
                            
                                Understanding c++11 memory fences
                            
                                How to make a multiple-read/single-write lock from more basic synchronization primitives?
                            
                                Dynamic aligned memory allocation in C++11
                            
                                Add a method to existing C++ class in other file
                            
                                Idiomatic way to distinguish two zero-arg constructors
                            
                                C++ #include semantics
                            
                                Efficiency of the STL priority_queue
                            
                                C++: Can a macro expand "abc" into 'a', 'b', 'c'?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With