Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create multibyte characters in C

During my study of character encoding in C and C++ I came across two general ways of encoding: multibyte characters and wide characters. In order to strengthen my understanding of those systems (benefits and drawbacks) I wanted to do some examples. Doing examples with wide characters is not a problem due to the native support with the wchar_t type. But when I wanted to create a string which contains those so called multibyte characters I came to a problem.

How can I actually create a multibyte character string which uses an encoding that works with a char array (using Visual C++)? This kind of encoding sure does exist: http://www.gnu.org/software/libc/manual/html_node/Shift-State.html. But I read only about it and never saw an actual example. Or do you have to create your own encoding for this kind of string?

like image 968
Sam Avatar asked Sep 04 '14 13:09

Sam


People also ask

How do you type a multibyte character?

If supported by your input device, multibyte characters can be entered directly. Otherwise, you can enter any multibyte character in the ASCII form \[N], where N is the 2-, 4-, 6-, 7-, or 8-digit hexadecimal encoding for the character.

What is multibyte character in C?

The term “multibyte character” is defined by ISO C to denote a byte sequence that encodes an ideogram, no matter what encoding scheme is employed. All multibyte characters are members of the “extended character set.” A regular single-byte character is just a special case of a multibyte character.

What are multibyte characters example?

Examples of multibyte character sets are the IBM-eucJP and the IBM-943 code sets. The single-byte code sets have at most 256 characters and the multibyte code sets have more than 256 (without any theoretical limit).

What is a UTF 8 multibyte character?

UTF-8 is a multibyte encoding able to encode the whole Unicode charset. An encoded character takes between 1 and 4 bytes. UTF-8 encoding supports longer byte sequences, up to 6 bytes, but the biggest code point of Unicode 6.0 (U+10FFFF) only takes 4 bytes.


1 Answers

If you are able to create a wide character string literal, simply omitting the L should give you a multibyte character string literal with an implementation defined encoding (gcc has an option to chose it, I don't know about visual C++).

If you have a wide character string, you can get the equivalent multibyte string according to the C locale using the functions wcstombs (in <stdlib.h>) and wcsrtombs (in <wchar.h>).

C++ locale system also provides a way to do that conversion. (Look for the in and out member of the codecvt facet, I won't provide here a tutorial on their use, the site cppreference has example codes, for instance for out).

I'm not sure you'll be able to find easily support either on Unix or on Windows for an encoding with a shift state. You should search for encoding for China, Japan, Korea, Vietman (for instance ISO 2022-JP, but it seems to me that Unix tend to use EUC-JP instead and Windows Shift JIS).

like image 147
AProgrammer Avatar answered Oct 17 '22 00:10

AProgrammer