Are UTF16 (as used by for example wide-winapi functions) characters always 2 byte long?

Tags:

Please clarify for me, how does UTF16 work? I am a little confused, considering these points:

There is a static type in C++, WCHAR, ~~which is 2 bytes long. (always 2 bytes long obvisouly)~~ (UPDATE: as shown by the answers, this assumption was wrong).
Most of msdn and some other documentation seem to have the assumptions that the characters are always 2 bytes long. This can just be my imagination, I can't come up with any particular examples, but it just seems that way.
There are no "extra wide" functions or characters types widely used in C++ or windows, so I would assume that UTF16 is all that is ever needed.
To my uncertain knowledge, unicode has a lot more characters than 65535, so they obvisouly don't have enough space in 2 bytes.
UTF16 seems to be a bigger version of UTF8, and UTF8 characters can be of different lengths.

So if a UTF16 character not always 2 bytes long, how long else could it be? 3 bytes? or only multiples of 2? And then for example if there is a winapi function that wants to know the size of a wide string in characters, and the string contains 2 characters which are each 4 bytes long, how is the size of that string in characters calculated?

Is it 2 chars long or 4 chars long? (since it is 8 bytes long, and each WCHAR is 2 bytes)

UPDATE: Now I see that character-counting is not necessarily a standard-thing or a c++ thing even, so I'll try to be a little more specific in my second question, about the length in "characters" of a wide string:

On Windows, specifically, in Winapi, in their wide functions (ending with W), how does one count the numer of characters in a string that consists of 2 unicode codepoints, each consisting of 2 codeunits (total of 8 bytes)? Is such a string 2 characters long (the same as number of codepoints) or 4 characters long(the same as total number of codeunits?)

Or, being more generic: What does the windows definition of "number of characters in a wide string" mean, number of codepoints or number of codeunits?

603

asked Jan 10 '11 23:01

Cray

2 Answers

Short answer: No.

The size of a wchar_t—the basic character unit—is not defined by the C++ Standard (see section 3.9.1 paragraph 5). In practice, on Windows platforms it is two bytes long, and on Linux/Mac platforms it is four bytes long.

In addition, the characters are stored in an endian-specific format. On Windows this usually means little-endian, but it’s also valid for a wchar_t to contain big-endian data.

Furthermore, even though each wchar_t is two (or four) bytes long, an individual glyph (roughly, a character) could require multiple wchar_ts, and there may be more than one way to represent it.

A common example is the character é (LATIN SMALL LETTER E WITH ACUTE), code point 0x00E9. This can also be represented as “decomposed” code point sequence 0x0065 0x0301 (which is LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT). Both are valid; see the Wikipedia article on Unicode equivalence for more information.

Simply, you need to know or pick the encoding that you will be using. If dealing with Windows APIs, an easy choice is to assume everything is little-endian UTF-16 stored in 2-byte wchar_ts.

On Linux/Mac UTF-8 (with chars) is more common and APIs usually take UTF-8. wchar_t is seen to be wasteful because it uses 4 bytes per character.

For cross-platform programming, therefore, you may wish to work with UTF-8 internally and convert to UTF-16 on-the-fly when calling Windows APIs. Windows provides the MultiByteToWideChar and WideCharToMultiByte functions to do this, and you can also find wrappers that simplify using these functions, such as the ATL and MFC String Conversion Macros.

Update

The question has been updated to ask what Windows APIs mean when they ask for the “number of characters” in a string.

If the API says “size of the string in characters” they are referring to the number of wchar_ts (or the number of chars if you are compiling in non-Unicode mode for some reason). In that specific case you can ignore the fact that a Unicode character may take more than one wchar_t. Those APIs are just looking to fill a buffer and need to know how much room they have.

174

answered Oct 12 '22 19:10

Nate

You seem to have several misconception.

There is a static type in C++, WCHAR, which is 2 bytes long. (always 2 bytes long obvisouly)

This is wrong. Assuming you refer to the c++ type wchar_t - It is not always 2 bytes long, 4 bytes is also a common value, and there's no restriction that it can be only those two values. If you don't refer to that, it isn't in C++ but is some platform-specific type.

There are no "extra wide" functions or characters types widely used in C++ or windows, so I would assume that UTF16 is all that is ever needed.

UTF16 seems to be a bigger version of UTF8, and UTF8 characters can be of different lengths.

UTF-8 and UTF-16 are different encodings for the same character set, so UTF-16 is not "bigger". Technically, the scheme used in UTF-8 could encode more characters than the scheme used in UTF-16, but as UTF-8 and UTF-16 they encode the same set.

Don't use the term "character" lightly when it comes to unicode. A codeunit in UTF-16 is 2 bytes wide, a codepoint is represented by 1 or 2 codeunits. What humans usually understand as "characters" is different and can be composed of one or more codepoints, and if you as a programmer confuse codepoints with characters bad things can happen like http://ideone.com/qV2il

answered Oct 12 '22 20:10

etarion

Related questions
                            
                                How would you unittest a memory allocator?
                            
                                Float values behaving differently across the release and debug builds
                            
                                Looking for a better C++ class factory
                            
                                Creating popup menu in Qt for QTableView
                            
                                How do you make a prototype of a function with parameters that have default values?
                            
                                How to output array of doubles to hard drive?
                            
                                Initialising a std::string from a character
                            
                                C++ cout printing slowly
                            
                                How is the x64 architecture different from x86
                            
                                Why doesn't g++ link with the dynamic library I create?
                            
                                is there a good way to combine stream manipulators?
                            
                                c++: local array definition versus a malloc call
                            
                                C++: Creating an uninitialized placeholder variable rather than a default object
                            
                                Dynamic data in c++
                            
                                memcpy adds ff ff ff to the beginning of a byte
                            
                                How to do my own custom runtime error class?
                            
                                Can C++ compilers automatically eliminate duplicate code?
                            
                                What does this code do: static union MSVC_EVIL_FLOAT_HACK INFINITY_HACK = {{0x00, 0x00, 0x80, 0x7F}};
                            
                                Boost timer: how to get time when I need?
                            
                                How to check (via the preprocessor) if a C source file is being compiled as C++ code

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Are UTF16 (as used by for example wide-winapi functions) characters always 2 byte long?

Tags:

c++

unicode

utf-8

utf-16

winapi