I want my program to be as portable as possible. I search a string for accented characters, e.g. è. Could this be a problem? Is there a C++ equivalent of HTML entities?
It would be used in a switch statement, for example:
switch(someChar) //someChar is of type char
{
case 'é' :
x = 1;
break;
case 'è' :
...
}
The main issue using non-ASCII characters in C++ source is that the compiler must be aware of the encoding used for the source. If the source is 7-bit ASCII then it doesn't usually matter, since most all compilers assume an ASCII compatible encoding by default.
Also not all compilers are configurable as to the encoding, so two compilers might unconditionally use incompatible encodings, meaning that using non-ASCII characters can result in source code that can't be used with both.
So consider what happens with your code to search for an accented character if the string being searched is UTF-8 (perhaps because the execution character set is UTF-8). Whether the character literal 'é' works as you expect or not, you will not be finding accented characters because accented characters won't be represented by any single byte. Instead you'd have to search for various byte sequences.
There are different kinds of escapes which C++ allows in character and string literals. Universal Character Names allow you to designate a Unicode code point, and will be handled exactly as if that character appeared in the source. For example \u00E9
or \U000000E9
.
(some other languages have \u
to support codepoints up to U+FFFF, but lack C++'s support for codepoints beyond that or make you use surrogate code points. You cannot use surrogate codepoints in C++ and instead C++ has the \U variant to support all codepoints directly.)
UCNs are also supposed to work outside of character and string literals. Outside such literals UCNs are restricted to characters not in the basic source character set. Until recently compilers didn't implement this (C++98) feature, however. Now Clang appears to have pretty complete support, MSVC seems to have at least partial support, and GCC purports to provide experimental support with the option -fextended-identifiers
.
Recall that UCNs are supposed to be treated identically with the actual character appearing in the source; Thus compilers with good UCN identifier support also allow you to simply write the identifiers using the actual character so long as the compiler's source encoding supports the character in the first place.
C++ also supports hex escapes. These are \x followed by any number of hexadecimal digits. A hex escape will represent a single integral value, as though it were a single codepoint with that value, and no conversion to the execution charset is done on the value. If you need to represent a specific byte (or char16_t, or char32_t, or wchar_t) value independent of encodings, then this is what you want.
There are also octal escapes but they aren't as commonly useful as UCNs or hex escapes.
Here's the diagnosic that Clang shows when you use 'é' in a source file encoded with ISO-8859-1 or cp1252:
warning: illegal character encoding in character literal [-Winvalid-source-encoding]
std::printf("%c\n",'<E9>');
^
Clang issues this only as a warning and will just directly output a char object with the source byte's value. This is done for backwards compatibility with non-UTF-8 source code.
If you use UTF-8 encoded source then you get this:
error: character too large for enclosing character literal type
std::printf("%c\n",'<U+00E9>');
^
Clang detects that the UTF-8 encoding corresponds to the Unicode codepoint U+00E9, and that this code point is outside the range a single char can hold, and so reports an error. (Clang escapes the non-ascii character as well, because it determined that the console it was run under couldn't handle printing the non-ascii character).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With