Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strings and character encoding in C++

I read a few posts about best practices for strings and character encoding in C++, but I am struggling a bit with finding a general purpose approach that seems to me reasonably simple and correct. Could I ask for comments on the following? I'm inclined to use UTF-8 and UTF-32, and to define something like:

typedef std::string string8;
typedef std::basic_string<uint32_t> string32;

The string8 class would be used for UTF-8, and having a separate type is just a reminder of the encoding. An alternative would be for string8 to be a subclass of std::string and to remove the methods that aren't quite right for UTF-8.

The string32 class would be used for UTF-32 when a fixed character size is desired.

The UTF-8 CPP functions, utf8::utf8to32() and utf8::utf32to8(), or even simpler wrapper functions, would be used to convert between the two.

like image 637
nassar Avatar asked Oct 16 '10 20:10

nassar


People also ask

What encoding does C use for strings?

UTF-8 and Shift JIS are often used in C byte strings, while UTF-16 is often used in C wide strings when wchar_t is 16 bits.

What are the 3 types of character encoding?

There are three different Unicode character encodings: UTF-8, UTF-16 and UTF-32.

What is the encoding of a string?

String objects use UTF-16 encoding. The problem with UTF-16 is that it cannot be modified. There is only one way that can be used to get different encoding i.e. byte[] array.

Does C use ASCII or Unicode?

As far as I know, the standard C's char data type is ASCII, 1 byte (8 bits).


1 Answers

If you plan on just passing strings around and never inspect them, you can use plain std::string though it's a poor man job.

The issue is that most frameworks, even the standard, have stupidly (I think) enforced encoding in memory. I say stupid because encoding should only matter on the interface, and those encoding are not adapted for in-memory manipulation of the data.

Furthermore, encoding is easy (it's a simple transposition CodePoint -> bytes and reversely) while the main difficulty is actually about manipulating the data.

With a 8-bits or 16-bits you run the risk of cutting a character in the middle because neither std::string nor std::wstring are aware of what a Unicode Character is. Worse, even with a 32-bits encoding, there is the risk of separating a character from the diacritics that apply to it, which is also stupid.

The support of Unicode in C++ is therefore extremely subpar, as far as the standard is concerned.

If you really wish to manipulate Unicode string, you need a Unicode aware container. The usual way is to use the ICU library, though its interface is really C-ish. However you'll get everything you need to actually work in Unicode with multiple languages.

like image 157
Matthieu M. Avatar answered Sep 18 '22 09:09

Matthieu M.