When I read a text file to a wide character string (std::wstring) using an wifstream, does the stream implementation support different encodings - i.e. can it be used to read e.g. ASCII, UTF-8, and UTF-16 files?
If not, what would I have to do?
(I need to read the entire file, if that makes a difference)
C++ supports character encodings by means of std::locale
and the facet std::codecvt
. The general idea is that a locale
object describes the aspects of the system that might vary from culture to culture, (human) language to language. These aspects are broken down into facet
s, which are template arguments that define how localization-dependent objects (include I/O streams) are constructed. When you read from an istream
or write to a ostream
, the actual writing of each character is filtered through the locale's facets. The facets cover not only encoding of Unicode types but such varied features as how large numbers are written (e.g. with commas or periods), currency, time, capitalization, and a slew of other details.
However just because the facilities exist to do encodings doesn't mean the standard library actually handles all encodings, nor does it make such code simple to do right. Even such basic things as the size of character you should be reading into (let alone the encoding part) is difficult, as wchar_t
can be too small (mangling your data), or too large (wasting space), and the most common compilers (e.g. Visual C++ and Gnu C++) do differ on how big their implementation is. So you generally need to find external libraries to do the actual encoding.
The most straightforward example I can find that covers all the bases, is from Boost's UTF-8 codecvt facet, with an example that specifically tries to encode UTF-8 (UCS4) for use by IO streams. It looks like this, though I don't suggest just copying it verbatim. It takes a little more digging in the source to understand it (and I don't claim to):
typedef wchar_t ucs4_t;
std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);
...
std::wifstream input_file("data.utf8");
input_file.imbue(utf8_locale);
ucs4_t item = 0;
while (ifs >> item) { ... }
To understand more about locales, and how they use facets (including codecvt
), take a look at the following:
ifstream
does not care about encoding of file. It just reads chars(bytes) from file. wifstream
reads wide bytes(wchar_t
), but it still doesn't know anything about file encoding. wifstream
is good enough for UCS-2 — fixed-length character encoding for Unicode (each character represented with two bytes).
You could use IBM ICU library to deal with Unicode files.
The International Component for Unicode (ICU) is a mature, portable set of C/C++ and Java libraries for Unicode support, software internationalization (I18N) and globalization (G11N), giving applications the same results on all platforms.
ICU is released under a nonrestrictive open source license that is suitable for use with both commercial software and with other open source or free software.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With