If I write the statement below in C++ under Visual Studio, what will be encoding here?
const char *c = "£";
Under the Visual Studio project settings I have set the "Charset" to "Not set".
u'\xe9' is a Unicode string that contains the unicode character U+00E9 (LATIN SMALL LETTER E WITH ACUTE).
For characters represented by the 7-bit ASCII character codes, the UTF-8 representation is exactly equivalent to ASCII, allowing transparent round trip migration. Other Unicode characters are represented in UTF-8 by sequences of up to 6 bytes, though most Western European characters require only 2 bytes3.
UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.
UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.
UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8. All other characters use two to four bytes.
Introduction. When working with Strings in Java, we oftentimes need to encode them to a specific charset, such as UTF-8. UTF-8 represents a variable-width character encoding that uses between one and four eight-bit bytes to represent all valid Unicode code points.
Setting the charset to 'Not Set' simply means that neither of the preprocessor macros _UNICODE and _MBCS will be set. This has no effect on what character sets are used by the compiler.
The two settings that determine how the bytes of your source are converted to a string literal in the program are the 'source character set' and the 'execution character set'. The compiler will convert string literals from the source encoding to the execution encoding.
The source encoding is the encoding used by the compiler to interpret the source file's bytes. It applies not just to string and character literals, but also to everything else in source including, for example, identifiers.
If Visual Studio's compiler detects a Unicode 'signature' in a source file then it will use the corresponding Unicode encoding as the source encoding. Otherwise it will use the system's codepage encoding as the source encoding.
The execution encoding is the encoding the compiler stores string and character literals as, such that the string and character data created by literals will be encoded using the execution encoding.
Visual Studio's compiler uses the system's codepage as the execution encoding.
When Visual Studio performs the conversion of string and character literal data from the source encoding to the execution encoding it will replace characters that cannot be represented in the execution encoding set with '?'.
So for your example:
const char *c = "£";
Assuming that your source is saved using Microsoft's "UTF-8 with signature" format and your system uses CP1252 as most systems in the West do, the string literal will be converted to:
0xA3 0x00
On the other hand, if the execution charset is something that doesn't include '£', such as cp1251 (Cyrillic, used in Window's Russian locale), then the string literal will end up:
0x3F 0x00
If you want to avoid depending on the source code encoding you can use Universal Character Names (UCNs):
const char *c = "\u00A3"; // "£"
If you want to guarantee a UTF-8 representation you'll also need to avoid dependence on the execution encoding. You can do that by manually encoding it:
const char *c = "\xC2\xA3"; // UTF-8 encoding of "£"
C++11 introduces UTF-8 string literals, which will be better when your compiler supports them:
const char *c = u8"£";
or
const char *c = u8"\u00A3"; // "£"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With