Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ strings: UTF-8 or 16-bit encoding?

I'm still trying to decide whether my (home) project should use UTF-8 strings (implemented in terms of std::string with additional UTF-8-specific functions when necessary) or some 16-bit string (implemented as std::wstring). The project is a programming language and environment (like VB, it's a combination of both).

There are a few wishes/constraints:

  • It would be cool if it could run on limited hardware, such as computers with limited memory.
  • I want the code to run on Windows, Mac and (if resources allow) Linux.
  • I'll be using wxWidgets as my GUI layer, but I want the code that interacts with that toolkit confined in a corner of the codebase (I will have non-GUI executables).
  • I would like to avoid working with two different kinds of strings when working with user-visible text and with the application's data.

Currently, I'm working with std::string, with the intent of using UTF-8 manipulation functions only when necessary. It requires less memory, and seems to be the direction many applications are going anyway.

If you recommend a 16-bit encoding, which one: UTF-16? UCS-2? Another one?

like image 464
Carl Seleborg Avatar asked Sep 19 '08 16:09

Carl Seleborg


1 Answers

UTF-16 is still a variable length character encoding (there are more than 2^16 unicode codepoints), so you can't do O(1) string indexing operations. If you're doing lots of that sort of thing, you're not saving anything in speed over UTF-8. On the other hand, if your text includes a lot of codepoints in the 256-65535 range, UTF-16 can be a substantial improvement in size. UCS-2 is a variation on UTF-16 that is fixed length, at the cost of prohibiting any codepoints greater than 2^16.

Without knowing more about your requirements, I would personally go for UTF-8. It's the easiest to deal with for all the reasons others have already listed.

like image 142
Nick Johnson Avatar answered Sep 19 '22 17:09

Nick Johnson