While trying to read a UTF-16 encoded file with hints from this answer, I got the problem that, after reading few thousand characters, the getline
-method starts to read in garbage mojibake.
Here is my main:
#include <cstdio>
#include <fstream>
#include <iostream>
#include <codecvt>
#include <locale>
int main(void) {
std::wifstream wif("test.txt", std::ios::binary);
setlocale(LC_ALL, "en_US.utf8");
if (wif.is_open())
{
wif.imbue(
std::locale(
wif.getloc(),
new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>
)
);
std::wstring wline;
while (std::getline(wif, wline))
{
std::wcout << wline;
}
wif.close();
}
return 0;
}
The test.txt
file contains FF
, FE
byte order mark, followed by 100 lines with 80 'a'
s in each line. Here is a bash-script that generates test.txt
on *nix:
#!/bin/bash
echo -n -e \\xFF\\xFE > test.txt
for i in $(seq 1 100)
do
for i in $(seq 1 80)
do
echo -n -e \\x61\\x00 >> test.txt
done
echo -n -e \\x0A\\x00 >> test.txt
done
Here is how I compile and run the main:
g++-8 -std=c++17 -g main.cpp -o m && ./m
What I expected: 8000 'a'
s are printed.
What actually happened:
After printing few thousand a
s, the output changes to following garbage:
aaaaaaaaaa愀愀愀愀愀愀愀愀愀愀
and occasionally non-printable characters that look like 0A00
in a rectangle.
The 愀
-character has binary codepoint value of 110000100000000
, so it looks like a
-byte followed by 0
-byte.
It seems as if some bytes are lost during reading, and from then on, everything is misaligned, and all the remaining symbols are decoded incorrectly. Or, because the output ends with a 0A00
-thingie, it might be that the endianness is reversed after reading few thousand a
s, but this behavior also wouldn't make any sense whatsoever.
Why does this happen, and what's the easiest way to fix it?
A simple workaround (but not a general solution)
If you are sure that the input file will have a particular endianness, then you can simply hardcode the endianness as shown in the example in the documentation:
wif.imbue(
std::locale(
wif.getloc(),
new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>
)
);
With a hardcoded std::little_endian
, the problem seems to disappear, and the file is read correctly. It probably won't work for files with the opposite endianness.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With