Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I use std::imbue to set the locale for std::wcout?

Tags:

I am trying to use the std::locale mechanism in C++11 to count words in different languages. Specifically, I have std::wstringstream which contains the title of a famous Russian novel ("Crime and Punishment" in English). What I want to do is to use the appropriate locale (ru_RU.utf8 on my Linux machine) to read the stringstream, count the words and print the results. I should also probably note that my system is set to use the en_US.utf8 locale.

The desired result is this:

0: "Преступление" 1: "и" 2: "наказание"  I counted 3 words. and the last word was "наказание" 

That all works when I set the global locale, but not when I attempt to imbue the wcout stream. When I try that, I get this result instead:

0: "????????????" 1: "?" 2: "?????????"  I counted 3 words. and the last word was "?????????" 

Also, when I attempt to use a solution suggested in the comments, (which can be activate by changing #define USE_CODECVT 0 to #define USE_CODECVT 1) I get the error mentioned in this other question.

Those interested in experimenting with the code, or with compiler settings or both may wish to use this live code.

My questions

  1. Why does that not work? Is it because wcout is already open?
  2. Is there way to use imbue rather than setting the global locale to do what I want?

If it makes a difference, I'm using g++ 4.8.3. The full code is shown below.

getwords.cpp

#include <iostream> #include <fstream> #include <sstream> #include <string> #include <locale>  #define USE_CODECVT 0 #define USE_IMBUE   1  #if USE_CODECVT #include <codecvt> #endif  using namespace std;  int main() { #if USE_CODECVT     locale ru("ru_RU.utf8",          new codecvt_utf8<wchar_t, 0x10ffff, consume_header>{}); #else     locale ru("ru_RU.utf8"); #endif #if USE_IMBUE     wcout.imbue(ru); #else     locale::global(ru); #endif     wstringstream in{L"Преступление и наказание"};     in.imbue(ru);     wstring word;     unsigned wordcount = 0;     while (in >> word) {         wcout << wordcount << ": \"" << word << "\"\n";         ++wordcount;     }     wcout << "\nI counted " << wordcount << " words.\n"         << "and the last word was \"" << word << "\"\n"; } 
like image 304
Edward Avatar asked Oct 15 '14 16:10

Edward


People also ask

What is STD locale?

std::locale class locale; An object of class std::locale is an immutable indexed set of immutable facets. Each stream object of the C++ input/output library is associated with an std::locale object and uses its facets for parsing and formatting of all data.

What does locale mean in C++?

In C++, a locale is a class called locale provided by the C++ Standard Library. The C++ class locale differs from the C locale because it is more than a language table, or data representation of the various culture and language dependencies.


1 Answers

First I did some more test using your code and I can confirm that L"Преступление и наказание" is a correct UTF16 string. I controlled the code of the individual characters, and they are correctly 0x41f, 0x440, 0x435, 0x441, 0x442, 0x443, 0x43f, 0x43b, 0x435, 0x43d, 0x438, 0x435, 0x20, 0x438, 0x20, 0x43d, 0x430, 0x43a, 0x430, 0x437, 0x430, 0x43d, 0x438, 0x435

I could not find any reference about it, but it looks like simply calling imbue is not enough. imbue it a method from basic_ios which is an ancestor of cout and wcout. It does act on numeric conversions, but on all my tests, it has no effect on the charset used for output.

By default, the locale used in a C++ (or C) program is ... the C locale which knows nothing about unicode. All printable ASCII characters (below 128) are outputted as is, and others are replaced with a ?. It is exactly what your program does.

To make it work correctly, you have to select a locale that knows about unicode characters with setlocale. Once this is done, you can change the numeric conversion by calling imbue, and as you selected a unicode charset all will be fine.

So provided your current locale uses an UTF-8 charset, you only have to add

setlocale(LC_ALL, ""); 

as first line in your program, and the output will be as expected :

0: "Преступление" 1: "и" 2: "наказание"  I counted 3 words. and the last word was "наказание" 

If your current locale does not use UTF-8, choose one that is installed on you system and that supports it. I used setlocale(LC_ALL, "fr_FR.UTF-8");, or even setlocale(LC_ALL, "en_US.UTF-8"); and both worked.

Edit :

In fact, the best way to correctly output unicode to screen is to use setlocale(LC_ALL, "");. It automatically adapts to the current charset. I tested with a stripped down variant using Latin1 charset (my system speaks natively french and not russian ...)

#include <iostream> #include <locale>  using namespace std;  int main() {     setlocale(LC_ALL, "");     wchar_t ws[] = { 0xe8, 0xe9, 0 };      wcout << ws << endl; } 

I tried it under Linux using UTF-8 charset and ISO-8859-1 (latin1) (resp export LANG=fr_FR.UTF-8 and export LANG=fr_FR.ISO-8859-1) and I got correctly èé in the proper charset. I tried it also under Windows XP, with codepage 851 (oem) and 1252 (ansi) (resp. chcp 850 and chcp 1252 with Lucida console charset), and got èé on the console too.

Edit 2 :

Of course, you can also set a global C++ locale with locale::global(locale(""); with default locale or locale::global(locale("ru_RU.UTF-8"); with russian locale, but it is more than simply calling setlocale. According to the documentation of Gnu implementation of C++ Standard Library about locale : there is only one relation (of the C++ locale mechanism) to the C locale mechanism: the global C locale is modified if a named C++ locale object is set as the global locale", that is: std::locale::global(std::locale("")); affects the C functions as if the following call was made: std::setlocale(LC_ALL, "");. On the other hand, there is no vice versa, that is, calling setlocale has no whatsoever on the C++ locale mechanism, in particular on the working of locale("").

So it really looks like there was an underlying C library mechanizme that should be first enabled with setlocale to allow imbue conversion to work correctly.

like image 182
Serge Ballesta Avatar answered Oct 17 '22 00:10

Serge Ballesta