Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it bad to have accented characters in c++ source code?

I want my program to be as portable as possible. I search a string for accented characters, e.g. è. Could this be a problem? Is there a C++ equivalent of HTML entities?

It would be used in a switch statement, for example:

switch(someChar) //someChar is of type char
{
   case 'é' :
        x = 1;
        break;
   case 'è' :
   ...
}
like image 206
Celeritas Avatar asked Aug 16 '12 22:08

Celeritas


1 Answers

The main issue using non-ASCII characters in C++ source is that the compiler must be aware of the encoding used for the source. If the source is 7-bit ASCII then it doesn't usually matter, since most all compilers assume an ASCII compatible encoding by default.

Also not all compilers are configurable as to the encoding, so two compilers might unconditionally use incompatible encodings, meaning that using non-ASCII characters can result in source code that can't be used with both.

  • GCC: has command-line options for setting the source, execution, and wide execution encodings. The defaults are set by the locale, which usually uses UTF-8 these days.
  • MSVC: uses so-called 'BOM' to determine source encoding (between UTF-16BE/LE, UTF-8, and the system locale encoding), and always uses the system locale as the execution encoding. edit: As of VS 2015 Update 2, MSVC supports compiler switches to control source and execution charsets, including support for UTF-8. see here
  • Clang: always uses UTF-8 as the source and execution encodings

So consider what happens with your code to search for an accented character if the string being searched is UTF-8 (perhaps because the execution character set is UTF-8). Whether the character literal 'é' works as you expect or not, you will not be finding accented characters because accented characters won't be represented by any single byte. Instead you'd have to search for various byte sequences.


There are different kinds of escapes which C++ allows in character and string literals. Universal Character Names allow you to designate a Unicode code point, and will be handled exactly as if that character appeared in the source. For example \u00E9 or \U000000E9.

(some other languages have \u to support codepoints up to U+FFFF, but lack C++'s support for codepoints beyond that or make you use surrogate code points. You cannot use surrogate codepoints in C++ and instead C++ has the \U variant to support all codepoints directly.)

UCNs are also supposed to work outside of character and string literals. Outside such literals UCNs are restricted to characters not in the basic source character set. Until recently compilers didn't implement this (C++98) feature, however. Now Clang appears to have pretty complete support, MSVC seems to have at least partial support, and GCC purports to provide experimental support with the option -fextended-identifiers.

Recall that UCNs are supposed to be treated identically with the actual character appearing in the source; Thus compilers with good UCN identifier support also allow you to simply write the identifiers using the actual character so long as the compiler's source encoding supports the character in the first place.

C++ also supports hex escapes. These are \x followed by any number of hexadecimal digits. A hex escape will represent a single integral value, as though it were a single codepoint with that value, and no conversion to the execution charset is done on the value. If you need to represent a specific byte (or char16_t, or char32_t, or wchar_t) value independent of encodings, then this is what you want.

There are also octal escapes but they aren't as commonly useful as UCNs or hex escapes.


Here's the diagnosic that Clang shows when you use 'é' in a source file encoded with ISO-8859-1 or cp1252:

warning: illegal character encoding in character literal [-Winvalid-source-encoding]
    std::printf("%c\n",'<E9>');
                       ^

Clang issues this only as a warning and will just directly output a char object with the source byte's value. This is done for backwards compatibility with non-UTF-8 source code.

If you use UTF-8 encoded source then you get this:

error: character too large for enclosing character literal type
    std::printf("%c\n",'<U+00E9>');
                       ^

Clang detects that the UTF-8 encoding corresponds to the Unicode codepoint U+00E9, and that this code point is outside the range a single char can hold, and so reports an error. (Clang escapes the non-ascii character as well, because it determined that the console it was run under couldn't handle printing the non-ascii character).

like image 51
bames53 Avatar answered Oct 16 '22 06:10

bames53