Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is size of char_traits<char16_t>::int_type not large enough?

Tags:

c++

Consider the following program:

#include <iostream>
#include <sstream>
#include <string>

int main(int, char **) {
  std::basic_stringstream<char16_t> stream;

  stream.put(u'\u0100');
  std::cout << " Bad: " << stream.bad() << std::endl;

  stream.put(u'\uFFFE');
  std::cout << " Bad: " << stream.bad() << std::endl;

  stream.put(u'\uFFFF');
  std::cout << " Bad: " << stream.bad() << std::endl;

  return 0;
}

The output is:

 Bad: 0                                                                                                                                                                                
 Bad: 0                                                                                                                                                                                
 Bad: 1  

It seems the reason the badbit gets set is because 'put' sets the badbit if the character equals std::char_traits::eof(). I can now no longer put to the stream.

At http://en.cppreference.com/w/cpp/string/char_traits it states:

int_type: an integer type that can hold all values of char_type plus EOF

But if char_type is the same as int_type (uint_least16_t) then how can this be true?

like image 880
lrm29 Avatar asked May 03 '17 20:05

lrm29


2 Answers

The standard is quite explicit, std::char_traits<char16_t>::int_type is a typedef for std::uint_least16_t, see [char.traits.specializations.char16_t], which also says:

The member eof() shall return an implementation-defined constant that cannot appear as a valid UTF-16 code unit.

I'm not sure precisely how that interacts with http://www.unicode.org/versions/corrigendum9.html but existing practice in the major C++ implementations is to use the all-ones bit pattern for char_traits<char16_t>::eof(), even when uint_least16_t has exactly 16 bits.

After a bit more thought, I think it's possible for implementations to meet the Character traits requirements by making std::char_traits<char16_t>::to_int_type(char_type) return U+FFFD when given U+FFFF. This satisfies the requirement for eof() to return:

a value e such that X::eq_int_type(e,X::to_int_type(c)) is false for all values c.

This would also ensure that it's possible to distinguish success and failure when checking the result of basic_streambuf<char16_t>::sputc(u'\uFFFF'), so that it only returns eof() on failure, and returns u'\ufffd' otherwise.

I'll try that. I've created https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80624 to track this in GCC.

I've also reported an issue against the standard, so we can fix the "cannot appear as a valid UTF-16 code unit" wording, and maybe fix it some other way too.

like image 114
Jonathan Wakely Avatar answered Nov 20 '22 10:11

Jonathan Wakely


The behavior is interesting, that:

stream.put(u'\uFFFF');

sets the badbit, while:

stream << u'\uFFFF';
char16_t c = u'\uFFFF'; stream.write( &c, 1 );

does not set badbit.

This answer only focus on the differences.

So let's check gcc's source code in bits/ostream.tcc, line 164~165, we can see that put() checks if the value equals to eof(), and set the badbit.

if (traits_type::eq_int_type(__put, traits_type::eof()))  // <== It checks the value!
    __err |= ios_base::badbit;

From line 196, we can see write() does not have this logic, it only checks if all the chars are written to the buffer.

This explains the behavior.

From std::basic_ostream::put's description:

Internally, the function accesses the output sequence by first constructing a sentry object. Then (if good), it inserts c into its associated stream buffer object as if calling its member function sputc, and finally destroys the sentry object before returning.

It does not tell anything about the check of eof().

So I would think this is either a bug in document or a bug in implementation.

like image 1
Mine Avatar answered Nov 20 '22 09:11

Mine