The C++ standard mentions multiple different character sets. In particular, it mentions the following character sets: <ol> <li>In 2.2 [lex.phases] bullet 1 physical source file characters and their mapping to the basic source character set is mentioned.</li> <li>In 2.2 [lex.phases] bullet 2 execution character set is mentioned.</li> <li>In 2.3 [lex.charset] paragraph 3 a basic execution character set and a basic execution wide-character set are mentioned.</li> <li>The same section 2.3 [lex.charset] 3 also mentions an execution character set and an execution wide-character set.</li> <li>When reading or writing files these use some other character set.</li> </ol> What are all those different character set used for, how are conversions between them done, and which of these values are locale dependent? In particular, how are string literals represented?

Here is a break down of the different character sets used by the compiler itself (all reference to the standard are for C++14, actually): <ol> <li>The physical source file characters are those used in the C++ source. Most likely these are now encoded using some Unicode encoding, e.g., UTF-8 or UTF-16. If you are from a European or an American background you may be using ASCII whose characters are conveniently encoded identically in UTF-8 (every ASCII file is a UTF-8 file but not the other way around). The physical source file characters_ may also be something unusual like EBCDIC.</li> <li> The basic source character set is what the compiler, at least conceptually, consumes. It is produced from the physical source file characters and either mapping them to their respective basic character or to a sequence of basic characters representing the physical source character using a universal character name (see 2.2 [lex.phases] paragraph 1). The basic source character set is a just a set of 96 character (2.3 [lex.charset] paragraph 1): <blockquote> a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 _ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " ’ and the 5 special characters space (' '), horizontal tab (\t), vertical tab (\v), form feed (\f), and newline (\n) </blockquote> The mapping between the physical source character set and the basic character set is implementation defined. </li> <li> The basic execution character set and the basic execution wide-character set are characters set capable of representing the basic source character set expanded by a few special character: <blockquote> alert ('\a'), backspace ('\b'), carriage return ('\r'), and a null character ('\0') </blockquote> The difference between the non-wide and the wide version is whether the characters are represented using <code>char</code> or <code>wchar_t</code>. </li> <li>The execution character set and the execution wide-character set are implementation defined extensions of the basic character set and the basic wide-character set. In 2.3 [lex.charset] paragraph 3 it is stated that the additional members and the values of the additional members of execution character set are locale specific. It isn't clear which locale is referred to but I suspect the locale used during compilation is meant. In any case, the execution character sets are implementation defined (also according to 2.3 [lex.charset] paragraph 3).</li> <li>Character and string literals are originally represented using the basic source character set with some characters possibly using universal character names. All of these are converted at compile time into the execution character set. According to 2.14.3 [lex.ccon] character literals representable as one <code>char</code> in the execution character set just work. If multiple <code>char</code>s are needed the character literals may be conditionally supported (and they'd have type <code>int</code>). For string literals the conversion is described in 2.14.5 [lex.string]. Paragraph 9 states that UTF-8 string literals (e.g. <code>u8"hello"</code>) result in a sequence of values corresponding to the code units of the UTF-8 string. Otherwise the translation of characters and universal character names is the same as that for character literals (in particular, it is implementation defined) although characters resulting in multi-byte sequences for narrow string just result in multiple characters (this case isn't necessary support for character literals).</li> </ol> So far, only the result of compilation is considered. Any character which isn't part of a character or a string literal is used to specify what the code does. The interesting question is what happened to the literals? The literals are all basically translated into an implementation defined representation. That is implementation defined means that it is somewhere documented what is supposed to happen but it can differ between different implementations. How does that help when dealing with characters or strings coming from somewhere? Well, any character or string which is read is converted to the corresponding execution character set. In particular, when a file is read, all characters are transformed to this common representation. Of course, for this transformation to work, the locale used for reading a file needs to be setup according to the encoding of that file. If the locale isn't explicitly mentioned the global locale is used which is initially determined by the system is used. The initial global locale is probably set somehow based on user preferences, e.g., based no environment variables. If a a file is read which uses a different encoding than this global locale, a corresponding different locale matching the encoding of the file needs to be used. Correspondingly, when writing characters using one of the execution character sets, these are converted according to the encoding specified by the current locale. Again, it may be necessary to replace the locale if a specific encoding is needed. All this effectively means that internally to a program all string and character processing happens using the implementation defined execution character set. All characters being read by a program need to be converted to this character set and all characters written start as characters in this execution character set and need to be converted appropriately to the external encoding. If course, in an ideal set up the conversion between the execution character set and the external representation is the identity, e.g., because the execution character set uses UTF-8 and the external representation also uses UTF-8. Correspondingly for the execution wide-character set except in this case UTF-16 would be used (one of the two variations as UTF-16 can either use big endian or little endian representation).

What are the different character sets used for?

1 Answers

Here is a break down of the different character sets used by the compiler itself (all reference to the standard are for C++14, actually):

The physical source file characters are those used in the C++ source. Most likely these are now encoded using some Unicode encoding, e.g., UTF-8 or UTF-16. If you are from a European or an American background you may be using ASCII whose characters are conveniently encoded identically in UTF-8 (every ASCII file is a UTF-8 file but not the other way around). The physical source file characters_ may also be something unusual like EBCDIC.
The basic source character set is what the compiler, at least conceptually, consumes. It is produced from the physical source file characters and either mapping them to their respective basic character or to a sequence of basic characters representing the physical source character using a universal character name (see 2.2 [lex.phases] paragraph 1). The basic source character set is a just a set of 96 character (2.3 [lex.charset] paragraph 1):

a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 _ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " ’

and the 5 special characters space (' '), horizontal tab (\t), vertical tab (\v), form feed (\f), and newline (\n)

The mapping between the physical source character set and the basic character set is implementation defined.
The basic execution character set and the basic execution wide-character set are characters set capable of representing the basic source character set expanded by a few special character:

alert ('\a'), backspace ('\b'), carriage return ('\r'), and a null character ('\0')

The difference between the non-wide and the wide version is whether the characters are represented using char or wchar_t.
The execution character set and the execution wide-character set are implementation defined extensions of the basic character set and the basic wide-character set. In 2.3 [lex.charset] paragraph 3 it is stated that the additional members and the values of the additional members of execution character set are locale specific. It isn't clear which locale is referred to but I suspect the locale used during compilation is meant. In any case, the execution character sets are implementation defined (also according to 2.3 [lex.charset] paragraph 3).
Character and string literals are originally represented using the basic source character set with some characters possibly using universal character names. All of these are converted at compile time into the execution character set. According to 2.14.3 [lex.ccon] character literals representable as one char in the execution character set just work. If multiple chars are needed the character literals may be conditionally supported (and they'd have type int). For string literals the conversion is described in 2.14.5 [lex.string]. Paragraph 9 states that UTF-8 string literals (e.g. u8"hello") result in a sequence of values corresponding to the code units of the UTF-8 string. Otherwise the translation of characters and universal character names is the same as that for character literals (in particular, it is implementation defined) although characters resulting in multi-byte sequences for narrow string just result in multiple characters (this case isn't necessary support for character literals).

So far, only the result of compilation is considered. Any character which isn't part of a character or a string literal is used to specify what the code does. The interesting question is what happened to the literals? The literals are all basically translated into an implementation defined representation. That is implementation defined means that it is somewhere documented what is supposed to happen but it can differ between different implementations.

How does that help when dealing with characters or strings coming from somewhere? Well, any character or string which is read is converted to the corresponding execution character set. In particular, when a file is read, all characters are transformed to this common representation. Of course, for this transformation to work, the locale used for reading a file needs to be setup according to the encoding of that file. If the locale isn't explicitly mentioned the global locale is used which is initially determined by the system is used. The initial global locale is probably set somehow based on user preferences, e.g., based no environment variables. If a a file is read which uses a different encoding than this global locale, a corresponding different locale matching the encoding of the file needs to be used.

Correspondingly, when writing characters using one of the execution character sets, these are converted according to the encoding specified by the current locale. Again, it may be necessary to replace the locale if a specific encoding is needed.

All this effectively means that internally to a program all string and character processing happens using the implementation defined execution character set. All characters being read by a program need to be converted to this character set and all characters written start as characters in this execution character set and need to be converted appropriately to the external encoding. If course, in an ideal set up the conversion between the execution character set and the external representation is the identity, e.g., because the execution character set uses UTF-8 and the external representation also uses UTF-8. Correspondingly for the execution wide-character set except in this case UTF-16 would be used (one of the two variations as UTF-16 can either use big endian or little endian representation).

answered Oct 28 '22 08:10

Dietmar Kühl

Related questions
                            
                                C++11 Generating random numbers from frequently changing range
                            
                                What precision are floating-point arithmetic operations done in?
                            
                                Friend operator in template struct raises redefinition error
                            
                                Sort when only equality is available
                            
                                Custom malloc implementation
                            
                                C++11: Does an assignment operator prevent a type from being POD, and thus being global-initialized?
                            
                                What is the difference between path::string() and path::generic_string() in boost?
                            
                                Is OpenCv already threaded?
                            
                                error C2823: a typedef template is illegal - function pointer
                            
                                Confused about array parameters
                            
                                Converting between WCHAR* to wstring without use of constructor
                            
                                Qt can't figure out how to thread my return value in my program
                            
                                Switching on scoped enum
                            
                                Split 3D MatND into vector of 2D Mat opencv
                            
                                gcc -Ofast - complete list of limitations
                            
                                Convert jstring to QString
                            
                                Why does the c++ regex_match function require the search string to be defined outside of the function?
                            
                                Mersenne Twister Reproducibility across Compilers [duplicate]
                            
                                Can a transposition table cause search instability
                            
                                Is C++ compiler allowed to optimize out unreferenced local objects

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are the different character sets used for?

Tags:

c++

encoding

utf-8

Dietmar Kühl

People also ask

1 Answers

Dietmar Kühl

Recent Activity

Donate For Us