How can I read a Unicode (UTF-8) file into wstring
(s) on the Windows platform?
The Difference Between Unicode and UTF-8 Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).
Use istreambuf_iterator to Read File Into String in C++ istreambuf_iterator is an input iterator that reads successive characters from the std::basic_streambuf object. Thus we can utilize istreambuf_iterator with an ifstream stream and read the whole contents of the file into a std::string .
UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”
With C++11 support, you can use std::codecvt_utf8 facet which encapsulates conversion between a UTF-8 encoded byte string and UCS2 or UCS4 character string and which can be used to read and write UTF-8 files, both text and binary.
In order to use facet you usually create locale object that encapsulates culture-specific information as a set of facets that collectively define a specific localized environment. Once you have a locale object, you can imbue your stream buffer with it:
#include <sstream>
#include <fstream>
#include <codecvt>
std::wstring readFile(const char* filename)
{
std::wifstream wif(filename);
wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
std::wstringstream wss;
wss << wif.rdbuf();
return wss.str();
}
which can be used like this:
std::wstring wstr = readFile("a.txt");
Alternatively you can set the global C++ locale before you work with string streams which causes all future calls to the std::locale
default constructor to return a copy of the global C++ locale (you don't need to explicitly imbue stream buffers with it then):
std::locale::global(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
According to a comment by @Hans Passant, the simplest way is to use _wfopen_s. Open the file with mode rt, ccs=UTF-8
.
Here is another pure C++ solution that works at least with VC++ 2010:
#include <locale>
#include <codecvt>
#include <string>
#include <fstream>
#include <cstdlib>
int main() {
const std::locale empty_locale = std::locale::empty();
typedef std::codecvt_utf8<wchar_t> converter_type;
const converter_type* converter = new converter_type;
const std::locale utf8_locale = std::locale(empty_locale, converter);
std::wifstream stream(L"test.txt");
stream.imbue(utf8_locale);
std::wstring line;
std::getline(stream, line);
std::system("pause");
}
Except for locale::empty()
(here locale::global()
might work as well) and the wchar_t*
overload of the basic_ifstream
constructor, this should even be pretty standard-compliant (where “standard” means C++0x, of course).
Here's a platform-specific function for Windows only:
size_t GetSizeOfFile(const std::wstring& path)
{
struct _stat fileinfo;
_wstat(path.c_str(), &fileinfo);
return fileinfo.st_size;
}
std::wstring LoadUtf8FileToString(const std::wstring& filename)
{
std::wstring buffer; // stores file contents
FILE* f = _wfopen(filename.c_str(), L"rtS, ccs=UTF-8");
// Failed to open file
if (f == NULL)
{
// ...handle some error...
return buffer;
}
size_t filesize = GetSizeOfFile(filename);
// Read entire file contents in to memory
if (filesize > 0)
{
buffer.resize(filesize);
size_t wchars_read = fread(&(buffer.front()), sizeof(wchar_t), filesize, f);
buffer.resize(wchars_read);
buffer.shrink_to_fit();
}
fclose(f);
return buffer;
}
Use like so:
std::wstring mytext = LoadUtf8FileToString(L"C:\\MyUtf8File.txt");
Note the entire file is loaded in to memory, so you might not want to use it for very large files.
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <cstdlib>
int main()
{
std::wifstream wif("filename.txt");
wif.imbue(std::locale("zh_CN.UTF-8"));
std::wcout.imbue(std::locale("zh_CN.UTF-8"));
std::wcout << wif.rdbuf();
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With