Consider the following program:
#include <iostream>
#include <sstream>
#include <string>
int main(int, char **) {
std::basic_stringstream<char16_t> stream;
stream.put(u'\u0100');
std::cout << " Bad: " << stream.bad() << std::endl;
stream.put(u'\uFFFE');
std::cout << " Bad: " << stream.bad() << std::endl;
stream.put(u'\uFFFF');
std::cout << " Bad: " << stream.bad() << std::endl;
return 0;
}
The output is:
Bad: 0
Bad: 0
Bad: 1
It seems the reason the badbit gets set is because 'put' sets the badbit if the character equals std::char_traits::eof(). I can now no longer put to the stream.
At http://en.cppreference.com/w/cpp/string/char_traits it states:
int_type: an integer type that can hold all values of char_type plus EOF
But if char_type is the same as int_type (uint_least16_t) then how can this be true?
The standard is quite explicit, std::char_traits<char16_t>::int_type
is a typedef for std::uint_least16_t
, see [char.traits.specializations.char16_t], which also says:
The member
eof()
shall return an implementation-defined constant that cannot appear as a valid UTF-16 code unit.
I'm not sure precisely how that interacts with http://www.unicode.org/versions/corrigendum9.html but existing practice in the major C++ implementations is to use the all-ones bit pattern for char_traits<char16_t>::eof()
, even when uint_least16_t
has exactly 16 bits.
After a bit more thought, I think it's possible for implementations to meet the Character traits requirements by making std::char_traits<char16_t>::to_int_type(char_type)
return U+FFFD when given U+FFFF. This satisfies the requirement for eof()
to return:
a value
e
such thatX::eq_int_type(e,X::to_int_type(c))
isfalse
for all valuesc
.
This would also ensure that it's possible to distinguish success and failure when checking the result of basic_streambuf<char16_t>::sputc(u'\uFFFF')
, so that it only returns eof()
on failure, and returns u'\ufffd'
otherwise.
I'll try that. I've created https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80624 to track this in GCC.
I've also reported an issue against the standard, so we can fix the "cannot appear as a valid UTF-16 code unit" wording, and maybe fix it some other way too.
The behavior is interesting, that:
stream.put(u'\uFFFF');
sets the badbit
, while:
stream << u'\uFFFF';
char16_t c = u'\uFFFF'; stream.write( &c, 1 );
does not set badbit
.
This answer only focus on the differences.
So let's check gcc's source code in bits/ostream.tcc, line 164~165, we can see that put()
checks if the value equals to eof()
, and set the badbit
.
if (traits_type::eq_int_type(__put, traits_type::eof())) // <== It checks the value!
__err |= ios_base::badbit;
From line 196, we can see write()
does not have this logic, it only checks if all the chars are written to the buffer.
This explains the behavior.
From std::basic_ostream::put
's description:
Internally, the function accesses the output sequence by first constructing a sentry object. Then (if good), it inserts c into its associated stream buffer object as if calling its member function sputc, and finally destroys the sentry object before returning.
It does not tell anything about the check of eof()
.
So I would think this is either a bug in document or a bug in implementation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With