C++ strings: UTF-8 or 16-bit encoding?

Question

I'm still trying to decide whether my (home) project should use UTF-8 strings (implemented in terms of std::string with additional UTF-8-specific functions when necessary) or some 16-bit string (implemented as std::wstring). The project is a programming language and environment (like VB, it's a combination of both).

There are a few wishes/constraints:

It would be cool if it could run on limited hardware, such as computers with limited memory.
I want the code to run on Windows, Mac and (if resources allow) Linux.
I'll be using wxWidgets as my GUI layer, but I want the code that interacts with that toolkit confined in a corner of the codebase (I will have non-GUI executables).
I would like to avoid working with two different kinds of strings when working with user-visible text and with the application's data.

Currently, I'm working with std::string, with the intent of using UTF-8 manipulation functions only when necessary. It requires less memory, and seems to be the direction many applications are going anyway.

If you recommend a 16-bit encoding, which one: UTF-16? UCS-2? Another one?

Nick Johnson · Accepted Answer

UTF-16 is still a variable length character encoding (there are more than 2^16 unicode codepoints), so you can't do O(1) string indexing operations. If you're doing lots of that sort of thing, you're not saving anything in speed over UTF-8. On the other hand, if your text includes a lot of codepoints in the 256-65535 range, UTF-16 can be a substantial improvement in size. UCS-2 is a variation on UTF-16 that is fixed length, at the cost of prohibiting any codepoints greater than 2^16.

Without knowing more about your requirements, I would personally go for UTF-8. It's the easiest to deal with for all the reasons others have already listed.

C++ strings: UTF-8 or 16-bit encoding?

Tags:

c++

encoding

utf-8

stdstring

ucs2

Carl Seleborg

1 Answers

Nick Johnson

Recent Activity

Donate For Us

C++ strings: UTF-8 or 16-bit encoding?

Tags:

c++

encoding

utf-8

stdstring

ucs2

Carl Seleborg

1 Answers

Nick Johnson

Related questions

Recent Activity

Donate For Us