Having a variable length encoding is indirectly forbidden in the standard. So I have several questions: How is the following part of the standard handled? <blockquote> 17.3.2.1.3.3 Wide-character sequences A wide-character sequence is an array object (8.3.4) A that can be declared as T A[N], where T is type wchar_t (3.9.1), optionally qualified by any combination of const or volatile. The initial elements of the array have defined contents up to and including an element determined by some predicate. A character sequence can be designated by a pointer value S that designates its first element. The length of an NTWCS is the number of elements that precede the terminating null wide character. An empty NTWCS has a length of zero. </blockquote> Questions: <code>basic_string<wchar_t></code> <ul> <li>How is <code>operator[]</code> implemented and what does it return? <ul> <li>standard: <code>If pos < size(), returns data()[pos]. Otherwise, if pos == size(), the const version returns charT(). Otherwise, the behavior is undefined.</code> </li> </ul> </li> <li>Does <code>size()</code> return the number of elements or the length of the string? <ul> <li>standard: <code>Returns: a count of the number of char-like objects currently in the string.</code> </li> </ul> </li> <li>How does <code>resize()</code> work? <ul> <li>unrelated to standard, just what does it do</li> </ul> </li> <li>How are the position in <code>insert()</code>, <code>erase()</code> and others handled?</li> </ul> <code>cwctype</code> <ul> <li>Pretty much everything in here. How is the variable encoding handled?</li> </ul> <code>cwchar</code> <ul> <li> <code>getwchar()</code> obviously can't return a whole platform-character, so how does this work?</li> </ul> Plus all the rest of the character function (the theme is the same). Edit: I will be opening a bounty to get some confirmation. I want to get some clear answers or at least a clearer distribution of votes. Edit: This is starting to get pointless. This is full of totally conflicting answers. Some of you talk about external encodings (I don't care about those, UTF-8 encoded will still be stored as UTF-16 once read into the string, the same for output), the rest simply contradicts each other. :-/

Here's how Microsoft's STL implementation handles the variable-length encoding: <code>basic_string<wchar_t>::operator[])(</code> can return a low or a high surrogate, in isolation. <code>basic_string<wchar_t>::size()</code> returns the number of <code>wchar_t</code> objects. A surrogate pair (one Unicode character) uses two wchar_t's and therefore adds two to the size. <code>basic_string<wchar_t>::resize()</code> can truncate a string in the middle of a surrogate pair. <code>basic_string<wchar_t>::insert()</code> can insert in the middle of a surrogate pair. <code>basic_string<wchar_t>::erase()</code> can erase either half of a surrogate pair. In general, the pattern should be clear: the STL does not assume that a <code>std::wstring</code> is in UTF-16, nor enforce that it remains UTF-16.

How does Microsoft handle the fact that UTF-16 is a variable length encoding in their C++ standard library implementation

Q: Is UTF-16 variable length?

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units.

Q: Does Windows use UTF-16 or UCS-2?

Windows uses UTF-16. Previously, it used UCS-2. Support for UTF-16 was added in Windows 2000. UTF-16 is a variable width 2-byte or 4-byte character encoding for Unicode.

Q: What is the advantage of using UTF-8 instead of UTF-16?

UTF-16 is, obviously, more efficient for A) characters for which UTF-16 requires fewer bytes to encode than does UTF-8. UTF-8 is, obviously, more efficient for B) characters for which UTF-8 requires fewer bytes to encode than does UTF-16.

Q: What is UTF-16 used for?

UTF-16 (16- bit Unicode Transformation Format) is a standard method of encoding Unicode character data. Part of the Unicode Standard version 3.0 (and higher-numbered versions), UTF-16 has the capacity to encode all currently defined Unicode characters.

Tags:

c++

utf-16

Having a variable length encoding is indirectly forbidden in the standard.

So I have several questions:

How is the following part of the standard handled?

17.3.2.1.3.3 Wide-character sequences

A wide-character sequence is an array object (8.3.4) A that can be declared as T A[N], where T is type wchar_t (3.9.1), optionally qualified by any combination of const or volatile. The initial elements of the array have defined contents up to and including an element determined by some predicate. A character sequence can be designated by a pointer value S that designates its first element.

The length of an NTWCS is the number of elements that precede the terminating null wide character. An empty NTWCS has a length of zero.

Questions:

basic_string<wchar_t>

How is operator[] implemented and what does it return?
- standard: If pos < size(), returns data()[pos]. Otherwise, if pos == size(), the const version returns charT(). Otherwise, the behavior is undefined.
Does size() return the number of elements or the length of the string?
- standard: Returns: a count of the number of char-like objects currently in the string.
How does resize() work?
- unrelated to standard, just what does it do
How are the position in insert(), erase() and others handled?

cwctype

Pretty much everything in here. How is the variable encoding handled?

cwchar

getwchar() obviously can't return a whole platform-character, so how does this work?

Plus all the rest of the character function (the theme is the same).

Edit: I will be opening a bounty to get some confirmation. I want to get some clear answers or at least a clearer distribution of votes.

Edit: This is starting to get pointless. This is full of totally conflicting answers. Some of you talk about external encodings (I don't care about those, UTF-8 encoded will still be stored as UTF-16 once read into the string, the same for output), the rest simply contradicts each other. :-/

289

asked Oct 26 '10 15:10

Šimon Tóth

1 Answers

Here's how Microsoft's STL implementation handles the variable-length encoding:

basic_string<wchar_t>::operator[])( can return a low or a high surrogate, in isolation.

basic_string<wchar_t>::size() returns the number of wchar_t objects. A surrogate pair (one Unicode character) uses two wchar_t's and therefore adds two to the size.

basic_string<wchar_t>::resize() can truncate a string in the middle of a surrogate pair.

basic_string<wchar_t>::insert() can insert in the middle of a surrogate pair.

basic_string<wchar_t>::erase() can erase either half of a surrogate pair.

In general, the pattern should be clear: the STL does not assume that a std::wstring is in UTF-16, nor enforce that it remains UTF-16.

answered Sep 28 '22 01:09

MSalters

Related questions
                            
                                c++ std::vector search for value
                            
                                How can i get the top n keys of std::map based on their values?
                            
                                Can a compilation error be forced if a string argument is not a string literal?
                            
                                Fastest way to find out if all elements in a vector are false or true c++?
                            
                                How to implement standard iterators in class
                            
                                Delegating constructor gives segmentation fault when using class field for argument
                            
                                Why do we use functions that return a data structure in C++? [duplicate]
                            
                                Overloaded functions in C++ DLL def file
                            
                                C++ odd compile error: error: changes meaning of "Object" from class "Object"
                            
                                Does it make sense to catch exceptions in the main(...)?
                            
                                How could I implement logical implication with bitwise or other efficient code in C?
                            
                                Why is my return type meaningless?
                            
                                OpenCV cvSaveImage Jpeg Compression Factor
                            
                                How to visualize bytes with C/C++
                            
                                Pattern name for create in constructor, delete in destructor (C++)
                            
                                Standard convention for using "std"
                            
                                Can someone explain this "endian-ness" function for me?
                            
                                Reallocating memory via "new" in C++
                            
                                Using a member function pointer within a class
                            
                                How essential is polymorphism for writing a text editor?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With