Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle unicode character sequences in C/C++?

What are the more portable and clean ways to handle unicode character sequences in C and C++ ?

Moreover, how to:

-Read unicode strings

-Convert unicode strings to ASCII to save some bytes (if the user only inputs ASCII)

-Print unicode strings

Should I use the environment too ? I've read about LC_CTYPE for example, should I care about it as a developer ?

like image 754
aksh Avatar asked Sep 02 '10 03:09

aksh


2 Answers

What are the more portable and clean ways to handle unicode character sequences in C and C++ ?

Have all strings in your program be UTF-8, UTF-16, or UTF-32. If for some reason you need to work with a non-Unicode encoding, do the conversion on input and output.

Read unicode strings

Same way you'd read an ASCII file. But there's still a lot of non-Unicode data around, so you'll want to check whether the data is Unicode. If it's not (or if it's UTF-8 when your preferred internal encoding is UTF-32), you'll need to convert it.

  • UTF-8 and UTF-32 can be reliably detected by validation.
  • UTF-16 can be detected by the presence of a BOM.
  • If it's not a UTF encoding, it's likely in ISO-8859-1 or windows-1252.

Convert unicode strings to ASCII to save some bytes (if the user only inputs ASCII)

Don't. If your data is all ASCII, then UTF-8 will take exactly the same amount of space. And if it isn't, you'll lose information when you convert to ASCII. If you care about saving bytes.

  • Choose the optimal UTF encoding. For characters U+0000 to U+007F, UTF-8 is the smallest. For characters U+0800 to U+FFFF, UTF-16 is the smallest.
  • Use data compression like gzip. There is a SCSU encoding specifically designed for Unicode, but I don't know how good it is.

Print unicode strings

Writing UTF-8 is no different from writing ASCII.

Except at the Windows command prompt, because it still uses the old "OEM" code pages. There you can use WriteConsoleW with UTF-16 strings.

Should I use the environment too ? I've read about LC_CTYPE for example, should I care about it as a developer ?

LC_CTYPE is a holdover from the days when every language had its own character encoding, and thus its own ctype.h functions. Today, the Unicode Character Database takes care of that. The beauty of Unicode is that it separates character encoding handling from locale handling (except for the special uppercase/lowercase rules for Lithuanian, Turkish, and Azeri).

But each language still has its own collation rules and number formatting rules, so you'll still need locales for those. And you'll need to set your locale's character encoding to UTF-8.

like image 121
dan04 Avatar answered Nov 04 '22 07:11

dan04


What are the more portable and clean ways to handle unicode character sequences in C and C++ ?

Use a library like ICU. If you can't, that is abso-freaking-lutely can't roll your own. Be prepared to have a Hard Time though. Also, do look up Unicode.org documentation on sample source code.

Should I use the environment too ?

Yes. You will probably need to use the std::setlocale function as well. This would allow you to set a locale corresponding to the encoding you want e.g. if you want to use British English as a language and UTF-8 as encoding you'd set LC_CTYPE to en_GB.UTF8.

C++03 does not give you a way to deal with Unicode. Your best bet is to use the wchar_t data type (and by extension std::wstring). However, note that the size and character encoding is different on different OS. E.g. Windows uses 2 bytes for wchar_t and UTF-16 encoding whereas GNU/Linux and Mac OSX use 4 bytes and UTF-32.

C++0x is supposed to amend the situation by allowing Unicode literals codecvt facets, C Unicode TR support (read <uchar.h>) etc. but then that's a long way for most compilers. (There are a few questions here on SO that ought to help you get started.)

like image 43
dirkgently Avatar answered Nov 04 '22 06:11

dirkgently