I am trying to use the std::locale
mechanism in C++11 to count words in different languages. Specifically, I have std::wstringstream
which contains the title of a famous Russian novel ("Crime and Punishment" in English). What I want to do is to use the appropriate locale (ru_RU.utf8
on my Linux machine) to read the stringstream, count the words and print the results. I should also probably note that my system is set to use the en_US.utf8
locale.
The desired result is this:
0: "Преступление" 1: "и" 2: "наказание" I counted 3 words. and the last word was "наказание"
That all works when I set the global locale, but not when I attempt to imbue
the wcout
stream. When I try that, I get this result instead:
0: "????????????" 1: "?" 2: "?????????" I counted 3 words. and the last word was "?????????"
Also, when I attempt to use a solution suggested in the comments, (which can be activate by changing #define USE_CODECVT 0
to #define USE_CODECVT 1
) I get the error mentioned in this other question.
Those interested in experimenting with the code, or with compiler settings or both may wish to use this live code.
wcout
is already open?imbue
rather than setting the global locale to do what I want?If it makes a difference, I'm using g++ 4.8.3. The full code is shown below.
#include <iostream> #include <fstream> #include <sstream> #include <string> #include <locale> #define USE_CODECVT 0 #define USE_IMBUE 1 #if USE_CODECVT #include <codecvt> #endif using namespace std; int main() { #if USE_CODECVT locale ru("ru_RU.utf8", new codecvt_utf8<wchar_t, 0x10ffff, consume_header>{}); #else locale ru("ru_RU.utf8"); #endif #if USE_IMBUE wcout.imbue(ru); #else locale::global(ru); #endif wstringstream in{L"Преступление и наказание"}; in.imbue(ru); wstring word; unsigned wordcount = 0; while (in >> word) { wcout << wordcount << ": \"" << word << "\"\n"; ++wordcount; } wcout << "\nI counted " << wordcount << " words.\n" << "and the last word was \"" << word << "\"\n"; }
std::locale class locale; An object of class std::locale is an immutable indexed set of immutable facets. Each stream object of the C++ input/output library is associated with an std::locale object and uses its facets for parsing and formatting of all data.
In C++, a locale is a class called locale provided by the C++ Standard Library. The C++ class locale differs from the C locale because it is more than a language table, or data representation of the various culture and language dependencies.
First I did some more test using your code and I can confirm that L"Преступление и наказание"
is a correct UTF16 string. I controlled the code of the individual characters, and they are correctly 0x41f, 0x440, 0x435, 0x441, 0x442, 0x443, 0x43f, 0x43b, 0x435, 0x43d, 0x438, 0x435, 0x20, 0x438, 0x20, 0x43d, 0x430, 0x43a, 0x430, 0x437, 0x430, 0x43d, 0x438, 0x435
I could not find any reference about it, but it looks like simply calling imbue
is not enough. imbue
it a method from basic_ios
which is an ancestor of cout
and wcout
. It does act on numeric conversions, but on all my tests, it has no effect on the charset used for output.
By default, the locale used in a C++ (or C) program is ... the C
locale which knows nothing about unicode. All printable ASCII characters (below 128) are outputted as is, and others are replaced with a ?
. It is exactly what your program does.
To make it work correctly, you have to select a locale that knows about unicode characters with setlocale
. Once this is done, you can change the numeric conversion by calling imbue
, and as you selected a unicode charset all will be fine.
So provided your current locale uses an UTF-8 charset, you only have to add
setlocale(LC_ALL, "");
as first line in your program, and the output will be as expected :
0: "Преступление" 1: "и" 2: "наказание" I counted 3 words. and the last word was "наказание"
If your current locale does not use UTF-8, choose one that is installed on you system and that supports it. I used setlocale(LC_ALL, "fr_FR.UTF-8");
, or even setlocale(LC_ALL, "en_US.UTF-8");
and both worked.
Edit :
In fact, the best way to correctly output unicode to screen is to use setlocale(LC_ALL, "");
. It automatically adapts to the current charset. I tested with a stripped down variant using Latin1 charset (my system speaks natively french and not russian ...)
#include <iostream> #include <locale> using namespace std; int main() { setlocale(LC_ALL, ""); wchar_t ws[] = { 0xe8, 0xe9, 0 }; wcout << ws << endl; }
I tried it under Linux using UTF-8 charset and ISO-8859-1 (latin1) (resp export LANG=fr_FR.UTF-8
and export LANG=fr_FR.ISO-8859-1
) and I got correctly èé
in the proper charset. I tried it also under Windows XP, with codepage 851 (oem) and 1252 (ansi) (resp. chcp 850
and chcp 1252
with Lucida console charset), and got èé
on the console too.
Edit 2 :
Of course, you can also set a global C++ locale with locale::global(locale("");
with default locale or locale::global(locale("ru_RU.UTF-8");
with russian locale, but it is more than simply calling setlocale
. According to the documentation of Gnu implementation of C++ Standard Library about locale : there is only one relation (of the C++ locale mechanism) to the C locale mechanism: the global C locale is modified if a named C++ locale object is set as the global locale", that is: std::locale::global(std::locale(""));
affects the C functions as if the following call was made: std::setlocale(LC_ALL, "");
. On the other hand, there is no vice versa, that is, calling setlocale has no whatsoever on the C++ locale mechanism, in particular on the working of locale("").
So it really looks like there was an underlying C library mechanizme that should be first enabled with setlocale
to allow imbue
conversion to work correctly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With