Using Unicode in C++ source code

Tags:

What is the standard encoding of C++ source code? Does the C++ standard even say something about this? Can I write C++ source in Unicode?

For example, can I use non-ASCII characters such as Chinese characters in comments? If so, is full Unicode allowed or just a subset of Unicode? (e.g., that 16-bit first page or whatever it's called.)

Furthermore, can I use Unicode for strings? For example:

Wstring str=L"Strange chars: âÂ Čšđ ě €€";

406

asked Dec 01 '08 18:12

Kresimir Cosic

2 Answers

Encoding in C++ is quite a bit complicated. Here is my understanding of it.

Every implementation has to support characters from the basic source character set. These include common characters listed in §2.2/1 (§2.3/1 in C++11). These characters should all fit into one char. In addition implementations have to support a way to name other characters using a way called universal-character-names and look like \uffff or \Uffffffff and can be used to refer to Unicode characters. A subset of them are usable in identifiers (listed in Annex E).

This is all nice, but the mapping from characters in the file, to source characters (used at compile time) is implementation defined. This constitutes the encoding used. Here is what it says literally (C++98 version):

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences (2.3) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.2) is replaced by the universal-character-name that des- ignates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.)

For gcc, you can change it using the option -finput-charset=charset. Additionally, you can change the execution character used to represet values at runtime. The proper option for this is -fexec-charset=charset for char (it defaults to utf-8) and -fwide-exec-charset=charset (which defaults to either utf-16 or utf-32 depending on the size of wchar_t).

answered Sep 19 '22 06:09

Johannes Schaub - litb

The C++ standard doesn't say anything about source-code file encoding, so far as I know.

The usual encoding is (or used to be) 7-bit ASCII -- some compilers (Borland's, for instance) would balk at ASCII characters that used the high-bit. There's no technical reason that Unicode characters can't be used, if your compiler and editor accept them -- most modern Linux-based tools, and many of the better Windows-based editors, handle UTF-8 encoding with no problem, though I'm not sure that Microsoft's compiler will.

EDIT: It looks like Microsoft's compilers will accept Unicode-encoded files, but will sometimes produce errors on 8-bit ASCII too:

warning C4819: The file contains a character that cannot be represented in the current code page (932). Save the file in Unicode format to prevent data loss.

answered Sep 19 '22 06:09

Head Geek

Related questions
                            
                                QComboBox - set selected item based on the item's data
                            
                                Difference between <string> and <string.h>?
                            
                                Prevent Firing Signals in Qt
                            
                                Avoid warning 'Unreferenced Formal Parameter'
                            
                                Iterating over a QMap with for
                            
                                Install Qt on Ubuntu
                            
                                C++ for a C# developer
                            
                                How to make thread sleep less than a millisecond on Windows
                            
                                How to convert vector to set? [closed]
                            
                                How can Boost be used to achieve C++14-style auto return types?
                            
                                Why does C++11 contain an odd clause about comparing void pointers?
                            
                                How is numpy so fast?
                            
                                what is the difference between set and unordered_set in C++?
                            
                                When did C++ compilers start considering more than two hex digits in string literal character escapes?
                            
                                Are int8_t and uint8_t intended to be char types?
                            
                                What can and can't I specialize in the std namespace?
                            
                                Can I use const in vectors to allow adding elements, but not modifications to the already added?
                            
                                Class static variable initialization order
                            
                                What is the point of clog?
                            
                                How does this "size of array" template function work? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using Unicode in C++ source code

Tags:

c++

character-encoding

unicode

standards

Kresimir Cosic

People also ask

2 Answers

Johannes Schaub - litb

Head Geek

Recent Activity

Donate For Us