Short version:
If I wanted to write program that can efficiently perform operations with Unicode characters, being able to input and output files in UTF-8 or UTF-16 encodings. What is the appropriate way to do this with C++?
Long version:
C++ predates Unicode, and both have evolved significantly since. I need to know how to write standards-compliant C++ code that is leak-free. I need a clear answers to:
Which string container should I pick?
std::string
with UTF-8?std::wstring
(don't really know much about it)std::u16string
with UTF-16?std::u32string
with UTF-32?Should I stick entirely to one of the above containers or change them when needed?
Can I use non-english characters in string literals, when using UTF strings, such as Polish characters: ąćęłńśźż
etc?
What changes when we store UTF-8 encoded characters in std::string
? Are they limited to one-byte ASCII characters or can they be multi-byte?
What happens when i do the following?
std::string s = u8"foo";
s += 'x';
What are differences between wchar_t and other multi-byte character types? Is wchar_t
character or wchar_t
string literal capable of storing UTF encodings?
10.1 Unicode Compliance Standards The Unicode Standard is the universal character-encoding scheme for written characters and text. It defines a consistent way of way of encoding multilingual text that enables the exchange of text data internationally and creates the foundation for global software.
There's no difference between ASCII and UTF-8 when storing digits. A tighter packing would be using 4 bits per digit (BCD). If you want to go below that, you need to take advantage of the fact that long sequences of 10-base values can be presented as 2-base (binary) values. Save this answer.
The Unicode Transformation Format (UTF) is a character encoding format which is able to encode all of the possible character code points in Unicode. The most prolific is UTF-8, which is a variable-length encoding and uses 8-bit code units, designed for backwards compatibility with ASCII encoding.
UTF-16 is better where ASCII is not predominant, since it uses 2 bytes per character, primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters. UTF-32 will cover all possible characters in 4 bytes.
Which string container should I pick?
That is really up to you to decide, based on your own particular needs. Any of the choices you have presented will work, and they each have their own advantages and disadvantages. Generically, UTF-8 is good to use for storage and communication purposes, and is backwards compatible with ASCII. Whereas UTF-16/32 is easier to use when processing Unicode data.
std::wstring
(don't really know much about it)
The size of wchar_t
is compiler-dependent and even platform-dependent. For instance, on Windows, wchar_t
is 2 bytes, making std::wstring
usable for UTF-16 encoded strings. On other platforms, wchar_t
may be 4 bytes instead, making std::wstring
usable for UTF-32 encoded strings instead. That is why wchar_t
/std::wstring
is generally not used in portable code, and why char16_t
/std::u16string
and char32_t
/std::u32string
were introduced in C++11. Even char
can have portability issues for UTF-8, since char
can be either signed or unsigned at the descretion of the compiler vendors, which is why char8_t
/std::u8string
was introduced in C++20 for UTF-8.
Should I stick entirely to one of the above containers or change them when needed?
Use whatever containers suit your needs.
Typically, you should use one string type throughout your code. Perform data conversions only at the boundaries where string data enters/leaves your program. For instance, when reading/writing files, network communications, platform system calls, etc.
How to properly convert between them?
There are many ways to handle that.
C++11 and later have std::wstring_convert
/std::wbuffer_convert
. But these were deprecated in C++17.
There are 3rd party Unicode conversion libraries, such as ICONV, ICU, etc.
There are C library functions, platform system calls, etc.
Can I use non-english characters in string literals, when using UTF strings, such as Polish characters:
ąćęłńśźż
etc?
Yes, if you use appropriate string literal prefixes:
u8
for UTF-8.
L
for UTF-16 or UTF-32 (depending on compiler/platform).
u16
for UTF-16.
u32
for UTF-32.
Also, be aware that the charset you use to save your source files can affect how the compiler interprets string literals. So make sure that whatever charset you choose to save your files in, like UTF-8, that you tell your compiler what that charset is, or else you may end up with the wrong string values at runtime.
What changes when we store UTF-8 encoded characters in
std::string
? Are they limited to one-byte ASCII characters or can they be multi-byte?
Each string character may be a single-byte, or be part of a multi-byte representation of a Unicode codepoint. It depends on the encoding of the string, and the character being encoded.
Just as std::wstring
(when wchar_t
is 2 bytes) and std::u16string
can hold strings containing supplementary characters outside of the Unicode BMP, which require UTF-16 surrogates to encode.
When a string container contains a UTF encoded string, each "character" is just a UTF encoded codeunit. UTF-8 encodes a Unicode codepoint as 1-4 codeunits (1-4 char
s in a std::string)
. UTF-16 encodes a codepoint as 1-2 codeunits (1-2 wchar_t
s/char16_t
s in a std::wstring
/std::u16string
). UTF-32 encodes a codepoint as 1 codeunit (1 char32_t
in a std::u32string
).
What happens when i do the following?
std::string s = u8"foo"; s += 'x';
Exactly what you would expect. A std::string
holds char
elements. Regardless of encoding, operator+=(char)
will simply append a single char
to the end of the std::string
.
How can I distinguish UTF
char[]
and non-UTFchar[]
orstd::string
?
You would need to have outside knowledge of the string's original encoding, or else perform your own heuristic analysis of the char[]
/std::string
data to see if it conforms to a UTF or not.
What are differences between wchar_t and other multi-byte character types?
Byte size and UTF encoding.
char
= ANSI/MBCS or UTF-8
wchar_t
= DBCS, UTF-16 or UTF-32, depending on compiler/platform
char8_t
= UTF-8
char16_t
= UTF-16
char32_t
= UTF-32
Is wchar_t character or wchar_t string literal capable of storing UTF encodings?
Yes, UTF-16 or UTF-32, depending on compiler/platform. In case of UTF-16, a single wchar_t
can only hold a codepoint value that is in the BMP. A single wchar_t
in UTF-32 can hold any codepoint value. A wchar_t
string can encode all codepoints in either encoding.
How to properly manipulate UTF strings (such as toupper/tolower conversion) and be compatible with locales simultaneously?
That is a very broad topic, worthy of its own separate question by itself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With