Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wrong CRLF in UTF-16 stream?

Here is a problem I could not solve despite all my efforts. So I am totally stuck, please help!

For regular, “ASCII” mode the following simplified file and stream outputs

FILE *fa = fopen("utfOutFA.txt", "w");
fprintf(fa, "Line1\nLine2");
fclose(fa);
ofstream sa("utfOutSA.txt");
sa << "Line1\nLine2";
sa.close();

result, naturally, in exactly the same text files (hex dump):

00000000h: 4C 69 6E 65 31 0D 0A 4C 69 6E 65 32             ; Line1..Line2

where the new line \n is expanded to CRLF: 0D 0A – typical for Windows.

Now, we do the same for Unicode output, namely UTF-16 LE which is a sort of “default”. File output

FILE *fu = fopen("utfOutFU.txt", "w, ccs=UNICODE");
fwprintf(fu, L"Line1\nLine2");
fclose(fu);

results in this contents:

00000000h: FF FE 4C 00 69 00 6E 00 65 00 31 00 0D 00 0A 00 ; ÿþL.i.n.e.1.....
00000010h: 4C 00 69 00 6E 00 65 00 32 00                   ; L.i.n.e.2.

which looks perfectly correct considering BOM and endianness, including CRLF: 0D 00 0A 00. However, the similar stream output

wofstream su("utfOutSU.txt");
su.imbue(locale(locale::empty(), new codecvt_utf16<wchar_t, 0x10ffffUL, 
                            codecvt_mode(generate_header + little_endian)>));
su << L"Line1\nLine2";
su.close();

results in one byte less and overall incorrect text file:

00000000h: FF FE 4C 00 69 00 6E 00 65 00 31 00 0D 0A 00 4C ; ÿþL.i.n.e.1....L
00000010h: 00 69 00 6E 00 65 00 32 00                      ; .i.n.e.2.

The reason is wrong expansion of CRLF: 0D 0A 00. Is this a bug? Or have I done something wrong?

I use Microsoft Visual Studio compiler (14.0 and other). I tried using stream endl instead of \n – same result! I tried to put su.imbue() first and then su.open() – all the same! I also checked the UTF-8 output (ccs=UTF-8 for file and codecvt_utf8 for stream) – no problem as CRLF stays the same as in ASCII mode: 0D 0A

I appreciate any ideas and comments on the issue.

like image 710
lariona Avatar asked Sep 19 '17 19:09

lariona


1 Answers

When you are imbue()'ing a new locale into the std::wofstream, you are wiping out its original locale. Don't use locale::empty(), use su.getloc() instead, so the new locale copies the old locale before modifying it.

Also, on a side note, the last template parameter of codecvt_utf16 is a bitmask, so codecvt_mode(generate_header + little_endian) really should be std::generate_header | std::little_endian instead.

su.imbue(std::locale(su.getloc(), new codecvt_utf16<wchar_t, 0x10ffffUL, 
                            std::generate_header | std::little_endian>));
like image 70
Remy Lebeau Avatar answered Nov 18 '22 16:11

Remy Lebeau