Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do useless backslashs produce well-defined string constants?

Both, C and C++, support an seemingly equivalent set of escape sequences like \b, \t, \n, \" and others starting with the backslash character (\). How is a backslash handled if normal character follows? As far as I remember from several compilers the escape character \ is silently skipped. On cppreference.com, I read these articles

  • Escape sequences (C)
  • Escape sequences (C++)

I only found this note (in the C article) about orphan backslashes

ISO C requires a diagnostic if the backslash is followed by any character not listed here: [...]

above the reference table. I had also a look an some online compilers

C demo

#include <stdio.h>

int main(void) {
    // your code goes here
    printf("%d", !strcmp("\\ x", "\\ x"));
    printf("%d", !strcmp("\\ x", "\\\ x"));
    printf("%d", !strcmp("\\ x", "\\\\ x"));
    return 0;
}

C++ demo

#include <iostream>
#include <string>
using namespace std;

int main() {
    cout << (string("\\ x") == "\\ x");
    cout << (string("\\ x") == "\\\ x");
    cout << (string("\\ x") == "\\\\ x");
    return 0;
}

Both treat "\\ x" and "\\\ x" as equivalent, (kind of) warning via syntax highlighting. IOW "\\\ x" has been transformed into "\\ x".

Can I assume this to be defined behavior?

Clarification (edit)

  • I'm not asking about obviously invalid string literals like "\".
  • I'm aware that an orphan backslash is somewhat problematic.
  • I want to know if the result, the constant built by the compiler, is defined.

Edit #2: Focus even more on constant being generated (and portability).

like image 516
Wolf Avatar asked Mar 11 '20 10:03

Wolf


2 Answers

Answer is no. It is an invalid C program and unspecified behavior C++ one.

C Standard

says it is syntactically wrong (emphasize is mine), it does not produce a valid token, thus the program is invalid:

5.2.1 Character sets

2/ In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or by escape sequences consisting of the backslash \ followed by one or more characters.

6.4.4.4 Character constants

3/ The single-quote ', the double-quote ", the question-mark ?, the backslash \, and arbitrary integer values are representable according to the following table of escape sequences:

  • single quote ' \'
  • double quote " \"
  • question mark ? \?
  • backslash \ \\
  • octal character \octal digits
  • hexadecimal character \xhexadecimal digits

8/ In addition, characters not in the basic character set are representable by universal character names and certain nongraphic characters are representable by escape sequences consisting of the backslash \ followed by a lowercase letter: \a, \b, \f, \n, \r, \t, and \v. Note : If any other character follows a backslash, the result is not a token and a diagnostic is required.

C++ standard

says differently (emphasize is mine):

5.13.3 Character literals

7/ Certain non-graphic characters, the single quote ’, the double quote ", the question mark ?,25 and the backslash \, can be represented according to Table 8. The double quote " and the question mark ?, can be represented as themselves or by the escape sequences \" and \? respectively, but the single quote ’ and the backslash \ shall be represented by the escape sequences \’ and \ respectively. Escape sequences in which the character following the backslash is not listed in Table 8 are conditionally-supported, with implementation-defined semantics. An escape sequence specifies a single character.

Thus for C++, you need to have a look at your compiler manual for the semantic, but the program is syntactically valid.

like image 188
Jean-Baptiste Yunès Avatar answered Oct 11 '22 06:10

Jean-Baptiste Yunès


You need to compile with a conforming C compiler. Various online compilers tend to use gcc which is by default set to "lax non-standard mode", aka GNU C. This may or may not enable some non-standard escape sequences, but it also won't produce compiler errors even when you violate the C language - you might get away with a "warning", but that doesn't make the code valid C.

If you tell gcc to behave as a conforming C compiler with -std=c17 -pedantic-errors, you get this error:

error: unknown escape sequence: '\040'

040 is octal for 32 which is the ASCII code for ' '. (For some reason gcc uses octal notation for escape sequences internally, might be because \0 is octal, I don't know why.)

like image 3
Lundin Avatar answered Oct 11 '22 06:10

Lundin