Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Microsoft handle the fact that UTF-16 is a variable length encoding in their C++ standard library implementation

Tags:

c++

utf-16

Having a variable length encoding is indirectly forbidden in the standard.

So I have several questions:

How is the following part of the standard handled?

17.3.2.1.3.3 Wide-character sequences

A wide-character sequence is an array object (8.3.4) A that can be declared as T A[N], where T is type wchar_t (3.9.1), optionally qualified by any combination of const or volatile. The initial elements of the array have defined contents up to and including an element determined by some predicate. A character sequence can be designated by a pointer value S that designates its first element.

The length of an NTWCS is the number of elements that precede the terminating null wide character. An empty NTWCS has a length of zero.

Questions:

basic_string<wchar_t>

  • How is operator[] implemented and what does it return?
    • standard: If pos < size(), returns data()[pos]. Otherwise, if pos == size(), the const version returns charT(). Otherwise, the behavior is undefined.
  • Does size() return the number of elements or the length of the string?
    • standard: Returns: a count of the number of char-like objects currently in the string.
  • How does resize() work?
    • unrelated to standard, just what does it do
  • How are the position in insert(), erase() and others handled?

cwctype

  • Pretty much everything in here. How is the variable encoding handled?

cwchar

  • getwchar() obviously can't return a whole platform-character, so how does this work?

Plus all the rest of the character function (the theme is the same).

Edit: I will be opening a bounty to get some confirmation. I want to get some clear answers or at least a clearer distribution of votes.

Edit: This is starting to get pointless. This is full of totally conflicting answers. Some of you talk about external encodings (I don't care about those, UTF-8 encoded will still be stored as UTF-16 once read into the string, the same for output), the rest simply contradicts each other. :-/

like image 289
Šimon Tóth Avatar asked Oct 26 '10 15:10

Šimon Tóth


People also ask

Is UTF-16 variable length?

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units.

Does Windows use UTF-16 or UCS-2?

Windows uses UTF-16. Previously, it used UCS-2. Support for UTF-16 was added in Windows 2000. UTF-16 is a variable width 2-byte or 4-byte character encoding for Unicode.

What is the advantage of using UTF-8 instead of UTF-16?

UTF-16 is, obviously, more efficient for A) characters for which UTF-16 requires fewer bytes to encode than does UTF-8. UTF-8 is, obviously, more efficient for B) characters for which UTF-8 requires fewer bytes to encode than does UTF-16.

What is UTF-16 used for?

UTF-16 (16- bit Unicode Transformation Format) is a standard method of encoding Unicode character data. Part of the Unicode Standard version 3.0 (and higher-numbered versions), UTF-16 has the capacity to encode all currently defined Unicode characters.


1 Answers

Here's how Microsoft's STL implementation handles the variable-length encoding:

basic_string<wchar_t>::operator[])( can return a low or a high surrogate, in isolation.

basic_string<wchar_t>::size() returns the number of wchar_t objects. A surrogate pair (one Unicode character) uses two wchar_t's and therefore adds two to the size.

basic_string<wchar_t>::resize() can truncate a string in the middle of a surrogate pair.

basic_string<wchar_t>::insert() can insert in the middle of a surrogate pair.

basic_string<wchar_t>::erase() can erase either half of a surrogate pair.

In general, the pattern should be clear: the STL does not assume that a std::wstring is in UTF-16, nor enforce that it remains UTF-16.

like image 99
MSalters Avatar answered Sep 28 '22 01:09

MSalters