What are the more portable and clean ways to handle unicode character sequences in C and C++ ?
Moreover, how to:
-Read unicode strings
-Convert unicode strings to ASCII to save some bytes (if the user only inputs ASCII)
-Print unicode strings
Should I use the environment too ? I've read about LC_CTYPE for example, should I care about it as a developer ?
What are the more portable and clean ways to handle unicode character sequences in C and C++ ?
Have all strings in your program be UTF-8, UTF-16, or UTF-32. If for some reason you need to work with a non-Unicode encoding, do the conversion on input and output.
Read unicode strings
Same way you'd read an ASCII file. But there's still a lot of non-Unicode data around, so you'll want to check whether the data is Unicode. If it's not (or if it's UTF-8 when your preferred internal encoding is UTF-32), you'll need to convert it.
Convert unicode strings to ASCII to save some bytes (if the user only inputs ASCII)
Don't. If your data is all ASCII, then UTF-8 will take exactly the same amount of space. And if it isn't, you'll lose information when you convert to ASCII. If you care about saving bytes.
Print unicode strings
Writing UTF-8 is no different from writing ASCII.
Except at the Windows command prompt, because it still uses the old "OEM" code pages. There you can use WriteConsoleW with UTF-16 strings.
Should I use the environment too ? I've read about LC_CTYPE for example, should I care about it as a developer ?
LC_CTYPE
is a holdover from the days when every language had its own character encoding, and thus its own ctype.h
functions. Today, the Unicode Character Database takes care of that. The beauty of Unicode is that it separates character encoding handling from locale handling (except for the special uppercase/lowercase rules for Lithuanian, Turkish, and Azeri).
But each language still has its own collation rules and number formatting rules, so you'll still need locales for those. And you'll need to set your locale's character encoding to UTF-8.
What are the more portable and clean ways to handle unicode character sequences in C and C++ ?
Use a library like ICU. If you can't, that is abso-freaking-lutely can't roll your own. Be prepared to have a Hard Time though. Also, do look up Unicode.org documentation on sample source code.
Should I use the environment too ?
Yes. You will probably need to use the std::setlocale
function as well. This would allow you to set a locale corresponding to the encoding you want e.g. if you want to use British English as a language and UTF-8 as encoding you'd set LC_CTYPE
to en_GB.UTF8
.
C++03 does not give you a way to deal with Unicode. Your best bet is to use the wchar_t
data type (and by extension std::wstring
). However, note that the size and character encoding is different on different OS. E.g. Windows uses 2 bytes for wchar_t
and UTF-16 encoding whereas GNU/Linux and Mac OSX use 4 bytes and UTF-32.
C++0x is supposed to amend the situation by allowing Unicode literals codecvt
facets, C Unicode TR support (read <uchar.h>
) etc. but then that's a long way for most compilers. (There are a few questions here on SO that ought to help you get started.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With