Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

No Unicode Streams in C++0x ? Why?

Today I have discovered, that C++ standard committee has dismissed Unicode streams support in C++0x in second revision. Fore more information see this question.

According this document:

The rationale for leaving out stream specializations of the two new types was that streams of non-char types have not attracted wide usage, so it is not clear that there is a real need for doubling the number of specializations of this very complicated machinery.

From this interview with Stroustrup:

Obviously, we ought to have Unicode streams and other much extended Unicode support in the standard library. The committee knew that but didn't have anyone with the skills and time to do the work, so unfortunately, this is one of the many areas where you have to look for "third party" support.

I'm not expert in Unicode, and I'm wondering why implementing Unicode streams is so difficult? What is so problematic with it?

like image 509
UmmaGumma Avatar asked Apr 14 '11 17:04

UmmaGumma


2 Answers

The first paragraph you cited tells you: it's not that Unicode streams in particular are more difficult than other streams, it's that iostreams in general are extremely complicated. Thus, implementing Unicode iostreams is difficult not because they are Unicode, but because they are iostreams.

like image 187
Ben Voigt Avatar answered Nov 07 '22 11:11

Ben Voigt


The paper N2238 is from 2007 and has no relevance. I'm not sure what Stroustrup is specifically referring to in the interview, but that isn't breaking news.

N3242 §22.5 still requires codecvt_utf8 and codecvt_utf16, which are all you need for Unicode file I/O. imbue the proper facet onto wcout and should be good to go… assuming you have a compliant library. However, in practice, GCC and MSVC already supply UTF-8, and I would expect that every serious C++ platform keeps parity between mbstowcs and codecvt.

There may be confusion because N3242 §22.5/5 says

— The multibyte sequences may be written only as a binary file. Attempting to write to a text file produces undefined behavior.

This is because text mode I/O converts line endings, so a 0x10 byte as half of a 16-bit UTF-16 word could be converted to 0x13, 0x10, corrupting the stream. This has nothing to do with poor support… just be sure to open up the file in binary mode, as you must with any library providing such functionality.

like image 25
Potatoswatter Avatar answered Nov 07 '22 10:11

Potatoswatter