I read a few posts about best practices for strings and character encoding in C++, but I am struggling a bit with finding a general purpose approach that seems to me reasonably simple and correct. Could I ask for comments on the following? I'm inclined to use UTF-8 and UTF-32, and to define something like:
typedef std::string string8;
typedef std::basic_string<uint32_t> string32;
The string8 class would be used for UTF-8, and having a separate type is just a reminder of the encoding. An alternative would be for string8 to be a subclass of std::string and to remove the methods that aren't quite right for UTF-8.
The string32 class would be used for UTF-32 when a fixed character size is desired.
The UTF-8 CPP functions, utf8::utf8to32() and utf8::utf32to8(), or even simpler wrapper functions, would be used to convert between the two.
UTF-8 and Shift JIS are often used in C byte strings, while UTF-16 is often used in C wide strings when wchar_t is 16 bits.
There are three different Unicode character encodings: UTF-8, UTF-16 and UTF-32.
String objects use UTF-16 encoding. The problem with UTF-16 is that it cannot be modified. There is only one way that can be used to get different encoding i.e. byte[] array.
As far as I know, the standard C's char data type is ASCII, 1 byte (8 bits).
If you plan on just passing strings around and never inspect them, you can use plain std::string
though it's a poor man job.
The issue is that most frameworks, even the standard, have stupidly (I think) enforced encoding in memory. I say stupid because encoding should only matter on the interface, and those encoding are not adapted for in-memory manipulation of the data.
Furthermore, encoding is easy (it's a simple transposition CodePoint -> bytes and reversely) while the main difficulty is actually about manipulating the data.
With a 8-bits or 16-bits you run the risk of cutting a character in the middle because neither std::string
nor std::wstring
are aware of what a Unicode Character is. Worse, even with a 32-bits encoding, there is the risk of separating a character from the diacritics that apply to it, which is also stupid.
The support of Unicode in C++ is therefore extremely subpar, as far as the standard is concerned.
If you really wish to manipulate Unicode string, you need a Unicode aware container. The usual way is to use the ICU
library, though its interface is really C-ish. However you'll get everything you need to actually work in Unicode with multiple languages.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With