Having a variable length encoding is indirectly forbidden in the standard.
So I have several questions:
How is the following part of the standard handled?
17.3.2.1.3.3 Wide-character sequences
A wide-character sequence is an array object (8.3.4) A that can be declared as T A[N], where T is type wchar_t (3.9.1), optionally qualified by any combination of const or volatile. The initial elements of the array have defined contents up to and including an element determined by some predicate. A character sequence can be designated by a pointer value S that designates its first element.
The length of an NTWCS is the number of elements that precede the terminating null wide character. An empty NTWCS has a length of zero.
Questions:
basic_string<wchar_t>
operator[]
implemented and what does it return?
If pos < size(), returns data()[pos]. Otherwise, if pos == size(), the const version returns charT(). Otherwise, the behavior is undefined.
size()
return the number of elements or the length of the string?
Returns: a count of the number of char-like objects currently in the string.
resize()
work?
insert()
, erase()
and others handled?cwctype
cwchar
getwchar()
obviously can't return a whole platform-character, so how does this work?Plus all the rest of the character function (the theme is the same).
Edit: I will be opening a bounty to get some confirmation. I want to get some clear answers or at least a clearer distribution of votes.
Edit: This is starting to get pointless. This is full of totally conflicting answers. Some of you talk about external encodings (I don't care about those, UTF-8 encoded will still be stored as UTF-16 once read into the string, the same for output), the rest simply contradicts each other. :-/
UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units.
Windows uses UTF-16. Previously, it used UCS-2. Support for UTF-16 was added in Windows 2000. UTF-16 is a variable width 2-byte or 4-byte character encoding for Unicode.
UTF-16 is, obviously, more efficient for A) characters for which UTF-16 requires fewer bytes to encode than does UTF-8. UTF-8 is, obviously, more efficient for B) characters for which UTF-8 requires fewer bytes to encode than does UTF-16.
UTF-16 (16- bit Unicode Transformation Format) is a standard method of encoding Unicode character data. Part of the Unicode Standard version 3.0 (and higher-numbered versions), UTF-16 has the capacity to encode all currently defined Unicode characters.
Here's how Microsoft's STL implementation handles the variable-length encoding:
basic_string<wchar_t>::operator[])(
can return a low or a high surrogate, in isolation.
basic_string<wchar_t>::size()
returns the number of wchar_t
objects. A surrogate pair (one Unicode character) uses two wchar_t's and therefore adds two to the size.
basic_string<wchar_t>::resize()
can truncate a string in the middle of a surrogate pair.
basic_string<wchar_t>::insert()
can insert in the middle of a surrogate pair.
basic_string<wchar_t>::erase()
can erase either half of a surrogate pair.
In general, the pattern should be clear: the STL does not assume that a std::wstring
is in UTF-16, nor enforce that it remains UTF-16.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With