Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read utf-16 file into utf-8 std::string line by line

I'm working with code that expects utf8-encoded std::string variables. I want to be able to handle a user-supplied file that potentially has utf-16 encoding (I don't know the encoding at design time, but eventually want to be able to deal with utf8/16/32), read it line-by-line, and forward each line to the rest of the code as a utf8-encoded std::string.

I have c++11 (really, the current MSVC subset of c++11) and boost 1.55.0 to work with. I'll need the code to work on both Linux and Windows variants eventually. For now, I'm just prototyping on Windows with Visual Studio 2013 Update 4, running on Windows 7. I'm open to additional dependencies, but they'd need to have an established cross-platform (meaning windows and *nix) track record, and shouldn't be GPL/LGPL.

I've been making assumptions that I don't seem to be able to find a way to validate, and I have code that is not working.

One assumption is that, since I ultimately want each line from these files in a std::string variable, I should be working with std::ifstream imbued with a properly-constructed codecvt such that the incoming utf16 stream can be converted to utf8.

Is this assumption realistic? The alternative, I thought, would be that I'd have to do some encoding checks on the text file, and then choose wifstream/wstring or ifstream/string based on the results, which seemed more unappealing than I'd like to start with. Of course, if that's the right (or the only realistic) path, I'm open to it.

I realize that I may likely need to do some encoding detection anyway, but for now, I am not so concerned about the encoding detection part, just focusing on getting utf16 file contents into utf8 std::string.

I have tried a variety of different combinations of locale and codecvt, none of which have worked. Below is the latest incarnation of what I thought might work, but doesn't:

void
SomeRandomClass::readUtf16LeFile( const std::string& theFileName )
{
    boost::locale::generator gen;
    std::ifstream file( theFileName );
    auto utf8Locale = gen.generate( "UTF-8" );
    std::locale cvtLocale( utf8Locale,
                           new std::codecvt_utf8_utf16<char>() );

    file.imbue( utf8Locale );
    std::string line;

    std::cout.imbue( utf8Locale );
    for ( int i = 0; i < 3; i++ )
    {
        std::getline( file, line );
        std::cout << line << std::endl;
    }
}

The behavior I see with this code is that the result of each call to getline() is an empty string, regardless of the file contents.

This same code works fine (meaning, each getline() call returns a correctly-encoded non-empty string) on a utf8-encoded version of the same file if I omit lines 3 and 5 of the above method.

For whatever reason, I could not find any examples anywhere here on SO or on http://en.cppreference.com/, or elsewhere in the wild, of anyone trying to do this same thing.

All ideas/suggestions (conformant to requirements above) welcome.

like image 205
Hoobajoob Avatar asked Mar 12 '15 14:03

Hoobajoob


1 Answers

Reading UTF-16 writing UTF-8

The first question you have to clarify, is about what variation of UTF16 you are reading:

  • is it UTF-16LE (i.e. generated under windows) ?
  • is it UTF-16BE (generated by wstream by default) ?
  • is it UTF16 with a BOM ?

The next question is to know whether you can really output your UTF8 or UTF16 on the console, knowing that the default windows console can really cause headakes for that.

Step 1: Make sure that the problem is no related to the win console

So here a small code to read an UTF-16LE and check the content with a native windows function (you just have to include <windows.h> in your console app):

    wifstream is16(filename);
    is16.imbue(locale(is16.getloc(), new codecvt_utf16<wchar_t, 0x10ffff, little_endian>()));
    wstring wtext, wline;
    for (int i = 0; getline(is16, wline); i++)
        wtext += wline + L"\n";
    MessageBoxW(NULL, wtext.c_str(), L"UTF16-Little Endian", MB_OK);

If your file is an UTF-16 with a BOM, just replace litte_endian with consume_header.

Step 2: Convert your utf16-string back into utf8 string

You have to use a string converter:

    wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> converter;

    wifstream is16(filename);
    is16.imbue(locale(is16.getloc(), new codecvt_utf16<wchar_t, 0x10ffff, little_endian>()));
    wstring wline;
    string u8line; 
    for (int i = 0; i < 10 && getline(is16, wline); i++) {
         u8line = converter.to_bytes(wline);
         cout << u8line<<endl; 
    }

This will show you the ascii caracters well on the win console. However all the utf8 encodings will appear as garbage (unless you're more successful than I for setting the console to display the unicode font).

Step 3: check the utf8 encoding using a file

As win console is pretty bad at it, the best thing would be to write the charset that you produced directly into a file and open this file with a text editor (lke Notepad++) wich can show you the encoding.

Nota bene: all this was done using only standard library (except for the intermediary MessageBoxW()) and its locale.

Further steps

If you want to detect the encoding, the first thing to start with is to see if there is a BOM, at the very begin of your file (opened for binary input, default "C" locale) :

char bom_utf8[]{0xEF, 0xBB, 0xBF};
char bom_utf16be[] { 0xFE, 0xFF};
char bom_utf16le[] { 0xFf, 0xFe};
char bom_utf32be[] { 0, 0, 0xFf, 0xFe};
char bom_uff32le[] { 0xFf, 0xFe, 0, 0};

Just load the first few bytes, and compare with this data.

If you've found one, it's ok. If not, you'll have to iterate through the file.

A quick approximation if you expect western languages, is the following: If you find a lot of null bytes (>25% <50%), it's probably utf16. If you find more than 50% of nulls, it's probably utf32.

But a more precise approach could make sense. For instance, to verify if the file is UTF16, you just have to implement a small state machine that checks that anytimes a first word has a high byte between 0xD8 and 0xDB, the next word has its high byte between 0xDC and 0xDF. What's high and what's low depend of course if it's little or big endian.

For UTF8 it's a similar practice,but the state machine is a little bit more complex because the bit pattern of the first char defines how many chars must follow, and each of the follwer must have a bit pattern (c & 0xC0) == 0x80.

like image 153
Christophe Avatar answered Sep 20 '22 23:09

Christophe