During my study of character encoding in C and C++ I came across two general ways of encoding: multibyte characters and wide characters. In order to strengthen my understanding of those systems (benefits and drawbacks) I wanted to do some examples. Doing examples with wide characters is not a problem due to the native support with the wchar_t type. But when I wanted to create a string which contains those so called multibyte characters I came to a problem.
How can I actually create a multibyte character string which uses an encoding that works with a char array (using Visual C++)? This kind of encoding sure does exist: http://www.gnu.org/software/libc/manual/html_node/Shift-State.html. But I read only about it and never saw an actual example. Or do you have to create your own encoding for this kind of string?
If supported by your input device, multibyte characters can be entered directly. Otherwise, you can enter any multibyte character in the ASCII form \[N], where N is the 2-, 4-, 6-, 7-, or 8-digit hexadecimal encoding for the character.
The term “multibyte character” is defined by ISO C to denote a byte sequence that encodes an ideogram, no matter what encoding scheme is employed. All multibyte characters are members of the “extended character set.” A regular single-byte character is just a special case of a multibyte character.
Examples of multibyte character sets are the IBM-eucJP and the IBM-943 code sets. The single-byte code sets have at most 256 characters and the multibyte code sets have more than 256 (without any theoretical limit).
UTF-8 is a multibyte encoding able to encode the whole Unicode charset. An encoded character takes between 1 and 4 bytes. UTF-8 encoding supports longer byte sequences, up to 6 bytes, but the biggest code point of Unicode 6.0 (U+10FFFF) only takes 4 bytes.
If you are able to create a wide character string literal, simply omitting the L should give you a multibyte character string literal with an implementation defined encoding (gcc has an option to chose it, I don't know about visual C++).
If you have a wide character string, you can get the equivalent multibyte string according to the C locale using the functions wcstombs
(in <stdlib.h>
) and wcsrtombs
(in <wchar.h>
).
C++ locale system also provides a way to do that conversion. (Look for the in
and out
member of the codecvt
facet, I won't provide here a tutorial on their use, the site cppreference has example codes, for instance for out).
I'm not sure you'll be able to find easily support either on Unix or on Windows for an encoding with a shift state. You should search for encoding for China, Japan, Korea, Vietman (for instance ISO 2022-JP, but it seems to me that Unix tend to use EUC-JP instead and Windows Shift JIS).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With