Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read Unicode UTF-8 file into wstring

How can I read a Unicode (UTF-8) file into wstring(s) on the Windows platform?

like image 654
Abdelwahed Avatar asked Jan 23 '11 18:01

Abdelwahed


People also ask

Is UTF-8 the same as Unicode?

The Difference Between Unicode and UTF-8 Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).

How do I read a string from a file in C++?

Use istreambuf_iterator to Read File Into String in C++ istreambuf_iterator is an input iterator that reads successive characters from the std::basic_streambuf object. Thus we can utilize istreambuf_iterator with an ifstream stream and read the whole contents of the file into a std::string .

What does UTF-8 mean in Unicode?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”


4 Answers

With C++11 support, you can use std::codecvt_utf8 facet which encapsulates conversion between a UTF-8 encoded byte string and UCS2 or UCS4 character string and which can be used to read and write UTF-8 files, both text and binary.

In order to use facet you usually create locale object that encapsulates culture-specific information as a set of facets that collectively define a specific localized environment. Once you have a locale object, you can imbue your stream buffer with it:

#include <sstream>
#include <fstream>
#include <codecvt>

std::wstring readFile(const char* filename)
{
    std::wifstream wif(filename);
    wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
    std::wstringstream wss;
    wss << wif.rdbuf();
    return wss.str();
}

which can be used like this:

std::wstring wstr = readFile("a.txt");

Alternatively you can set the global C++ locale before you work with string streams which causes all future calls to the std::locale default constructor to return a copy of the global C++ locale (you don't need to explicitly imbue stream buffers with it then):

std::locale::global(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
like image 57
LihO Avatar answered Oct 06 '22 01:10

LihO


According to a comment by @Hans Passant, the simplest way is to use _wfopen_s. Open the file with mode rt, ccs=UTF-8.

Here is another pure C++ solution that works at least with VC++ 2010:

#include <locale>
#include <codecvt>
#include <string>
#include <fstream>
#include <cstdlib>

int main() {
    const std::locale empty_locale = std::locale::empty();
    typedef std::codecvt_utf8<wchar_t> converter_type;
    const converter_type* converter = new converter_type;
    const std::locale utf8_locale = std::locale(empty_locale, converter);
    std::wifstream stream(L"test.txt");
    stream.imbue(utf8_locale);
    std::wstring line;
    std::getline(stream, line);
    std::system("pause");
}

Except for locale::empty() (here locale::global() might work as well) and the wchar_t* overload of the basic_ifstream constructor, this should even be pretty standard-compliant (where “standard” means C++0x, of course).

like image 36
Philipp Avatar answered Oct 06 '22 00:10

Philipp


Here's a platform-specific function for Windows only:

size_t GetSizeOfFile(const std::wstring& path)
{
    struct _stat fileinfo;
    _wstat(path.c_str(), &fileinfo);
    return fileinfo.st_size;
}

std::wstring LoadUtf8FileToString(const std::wstring& filename)
{
    std::wstring buffer;            // stores file contents
    FILE* f = _wfopen(filename.c_str(), L"rtS, ccs=UTF-8");

    // Failed to open file
    if (f == NULL)
    {
        // ...handle some error...
        return buffer;
    }

    size_t filesize = GetSizeOfFile(filename);

    // Read entire file contents in to memory
    if (filesize > 0)
    {
        buffer.resize(filesize);
        size_t wchars_read = fread(&(buffer.front()), sizeof(wchar_t), filesize, f);
        buffer.resize(wchars_read);
        buffer.shrink_to_fit();
    }

    fclose(f);

    return buffer;
}

Use like so:

std::wstring mytext = LoadUtf8FileToString(L"C:\\MyUtf8File.txt");

Note the entire file is loaded in to memory, so you might not want to use it for very large files.

like image 44
AshleysBrain Avatar answered Oct 06 '22 01:10

AshleysBrain


#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <cstdlib>

int main()
{
    std::wifstream wif("filename.txt");
    wif.imbue(std::locale("zh_CN.UTF-8"));

    std::wcout.imbue(std::locale("zh_CN.UTF-8"));
    std::wcout << wif.rdbuf();
}
like image 25
Shen Yu Avatar answered Oct 06 '22 01:10

Shen Yu