Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ and UTF8 - Why not just replace ASCII?

In my application I have to constantly convert string between std::string and std::wstring due different APIs (boost, win32, ffmpeg etc..). Especially with ffmpeg the strings end up utf8->utf16->utf8->utf16, just to open a file.

Since UTF8 is backwards compatible with ASCII I thought that I consistently store all my strings UTF-8 std::string and only convert to std::wstring when I have to call certain unusual functions.

This worked kind of well, I implemented to_lower, to_upper, iequals for utf8. However then I met several dead-ends std::regex, and regular string comparisons. To make this usable I would need to implement a custom ustring class based on std::string with re-implementation of all corresponding algorithms (including regex).

Basically my conclusion is that utf8 is not very good for general usage. And the current std::string/std::wstring is mess.

However, my question is why the default std::string and "" are not simply changed to use UTF8? Especially as UTF8 is backward compatible? Is there possibly some compiler flag which can do this? Of course the stl implemention would need to be automatically adapted.

I've looked at ICU, but it is not very compatible with apis assuming basic_string, e.g. no begin/end/c_str etc...

like image 551
ronag Avatar asked Dec 06 '11 13:12

ronag


3 Answers

The main issue is the conflation of in-memory representation and encoding.

None of the Unicode encoding is really amenable to text processing. Users will in general care about graphemes (what's on the screen) while the encoding is defined in terms of code points... and some graphemes are composed of several code points.

As such, when one asks: what is the 5th character of "Hélène" (French first name) the question is quite confusing:

  • In terms of graphemes, the answer is n.
  • In terms of code points... it depends on the representation of é and è (they can be represented either as a single code point or as a pair using diacritics...)

Depending on the source of the question (a end-user in front of her screen or an encoding routine) the response is completely different.

Therefore, I think that the real question is Why are we speaking about encodings here?

Today it does not make sense, and we would need two "views": Graphemes and Code Points.

Unfortunately the std::string and std::wstring interfaces were inherited from a time where people thought that ASCII was sufficient, and the progress made didn't really solve the issue.

I don't even understand why the in-memory representation should be specified, it is an implementation detail. All a user should want is:

  • to be able to read/write in UTF-* and ASCII
  • to be able to work on graphemes
  • to be able to edit a grapheme (to manage the diacritics)

... who cares how it is represented? I thought that good software was built on encapsulation?

Well, C cares, and we want interoperability... so I guess it will be fixed when C is.

like image 96
Matthieu M. Avatar answered Sep 28 '22 00:09

Matthieu M.


You cannot, the primary reason for this is named Microsoft. They decided not to support Unicode as UTF-8 so the support for UTF-8 under Windows is minimal.

Under windows you cannot use UTF-8 as a codepage, but you can convert from or to UTF-8.

like image 40
sorin Avatar answered Sep 27 '22 23:09

sorin


There are two snags to using UTF8 on windows.

  1. You cannot tell how many bytes a string will occupy - it depends on which characters are present, since some characters take 1 byte, some take 2, some take 3, and some take 4.

  2. The windows API uses UTF16. Since most windows programs make numerous calls to the windows API, there is quite an overhead converting back and forward. ( Note that you can do a "non-unicode' build, which looks like it uses a utf8 windows api, but all that is happening is that the conversion back and forward on each call is hidden )

The big snag with UTF16 is that the binary representation of a string depends on the byte order in a word on the particular hardware the program is running on. This does not matter in most cases, except when strings are transmitted between computers where you cannot be sure that the other computer uses the same byte order.

So what to do? I uses UTF16 everywhere 'inside' all my programs. When string data has to be stored in a file, or transmitted from a socket, I first convert it to UTF8.

This means that 95% of my code runs simply and most efficiently, and all the messy conversions between UTF8 and UTF16 can be isolated to routines responsible for I/O.

like image 26
ravenspoint Avatar answered Sep 27 '22 22:09

ravenspoint