Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does `getline` on `wifstream` read garbled input from UTF-16 encoded file?

While trying to read a UTF-16 encoded file with hints from this answer, I got the problem that, after reading few thousand characters, the getline-method starts to read in garbage mojibake.

Here is my main:

#include <cstdio>
#include <fstream>
#include <iostream>
#include <codecvt>
#include <locale>

int main(void) {

    std::wifstream wif("test.txt", std::ios::binary);
    setlocale(LC_ALL, "en_US.utf8");
    if (wif.is_open())
    {
        wif.imbue(
            std::locale(
                wif.getloc(),
                new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>
            )
        );

        std::wstring wline;
        while (std::getline(wif, wline))
        {
            std::wcout << wline;
        }

        wif.close();
    } 

    return 0;
}

The test.txt file contains FF, FE byte order mark, followed by 100 lines with 80 'a's in each line. Here is a bash-script that generates test.txt on *nix:

#!/bin/bash

echo -n -e \\xFF\\xFE > test.txt
for i in $(seq 1 100)
do
  for i in $(seq 1 80)
  do
    echo -n -e \\x61\\x00 >> test.txt
  done
  echo -n -e \\x0A\\x00 >> test.txt
done

Here is how I compile and run the main:

g++-8 -std=c++17 -g main.cpp -o m && ./m

What I expected: 8000 'a's are printed.

What actually happened:

After printing few thousand as, the output changes to following garbage:

aaaaaaaaaa愀愀愀愀愀愀愀愀愀愀

and occasionally non-printable characters that look like 0A00 in a rectangle.

The -character has binary codepoint value of 110000100000000, so it looks like a-byte followed by 0-byte.

It seems as if some bytes are lost during reading, and from then on, everything is misaligned, and all the remaining symbols are decoded incorrectly. Or, because the output ends with a 0A00-thingie, it might be that the endianness is reversed after reading few thousand as, but this behavior also wouldn't make any sense whatsoever.

Why does this happen, and what's the easiest way to fix it?

like image 820
Indestruktible Avatar asked Nov 06 '22 17:11

Indestruktible


1 Answers

A simple workaround (but not a general solution)

If you are sure that the input file will have a particular endianness, then you can simply hardcode the endianness as shown in the example in the documentation:

        wif.imbue(
            std::locale(
                wif.getloc(),
                new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>
            )
        );

With a hardcoded std::little_endian, the problem seems to disappear, and the file is read correctly. It probably won't work for files with the opposite endianness.

like image 149
Indestruktible Avatar answered Nov 15 '22 06:11

Indestruktible