Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++: Making my project support unicode

Tags:

c++

unicode

My C++ project currently is about 16K lines of code big, and I admit having completely not thought about unicode support in the first place.

All I have done was a custom typedef for std::string as String and jump into coding.

I have never really worked with unicode myself in programs I wrote.

  • How hard is it to switch my project to unicode now? Is it even a good idea?

  • Can I just switch to std::wchar without any major problems?

like image 220
Kxs Avatar asked Mar 13 '11 11:03

Kxs


3 Answers

Probably the most important part of making an application unicode aware is to track the encoding of your strings and to make sure that your public interfaces are well specified and easy to use with the encodings that you wish to use.

Switching to a wider character (in c++ wchar_t) is not necessarily the correct solution. In fact, I would say it is usually not the simplest solution. Some applications can get away with specifying that all strings and interfaces use UTF-8 and not need to change at all. std::string can perfectly well be used for UTF-8 encoded strings.

However, if you need to interpret the characters in a string or interface with non-UTF-8 interfaces then you will have to put more work in but without knowing more about your application it is impossible to recommend a single best approach.

like image 188
CB Bailey Avatar answered Oct 16 '22 12:10

CB Bailey


There are some issues with using std::wstring. If your application will be storing text in Unicode, and it will be running on different platforms, you may run into trouble. std::wstring relies on wchar_t, which is compiler dependent. In Microsoft Visual C++, this type is 16 bits wide, and will thus only support UTF-16 encodings. The GNU C++ compiler specifes this type to be 32 bits wide, and will thus only support UTF-32 encodings. If you then store the text in a file from one system (say Windows/VC++), and then read the file from another system (Linux/GCC), you will have to prepare for this (in this case convert from UTF-16 to UTF-32).

like image 2
Jörgen Sigvardsson Avatar answered Oct 16 '22 10:10

Jörgen Sigvardsson


Can I just switch to [std::wchar_t] without any major problems?

No, it's not that simple.

  • The encoding of a wchar_t string is platform-dependent. Windows uses UTF-16. Linux usually uses UTF-32. (C++0x will mitigate this difference by introducing separate char16_t and char32_t types.)
  • If you need to support Unix-like systems, you don't have all the UTF-16 functions that Windows has, so you'd need to write your own _wfopen, etc.
  • Do you use any third-party libraries? Do they support wchar_t?
  • Although wide characters are commonly-used for an in-memory representation, on-disk and on-the-Web formats are much more likely to be UTF-8 (or other char-based encoding) than UTF-16/32. You'd have to convert these.
  • You can't just search-and-replace char with wchar_t because C++ confounds "character" and "byte", and you have to determine which chars are characters and which chars are bytes.
like image 1
dan04 Avatar answered Oct 16 '22 11:10

dan04