Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

c++ towupper() doesn't convert certain characters

I use Borland C++ Builder 2009 and my application is translated into several languages, including Polish.

For a small piece of functionality I use towuppper() to capitalize a string, to put emphasis on it when first ignored by the user.

The original string is loaded from a language dll, into a utf16 wstring object and I convert like this:

int length = mystring.length() ;
for (int x = 0 ; x < length ; x++)
    {
    mystring[x] = towupper(mystring[x]);
    }

All this works well, except for Polish, where following sentence: "Rozumiem ryzykowność wykonania tej operacji" converts to "ROZUMIEM RYZYKOWNOść WYKONANIA TEJ OPERACJI" instead of "ROZUMIEM RYZYKOWNOŚĆ WYKONANIA TEJ OPERACJI"

(notice that the two last characters of the word "ryzykowność" do not convert).

It's not as if there are no capitalized Unicode variants of this character available. Unicode character 346 does the trick. http://www.fileformat.info/info/unicode/char/015a/index.htm

Is this a matter of an outdated library in my outdated compiler installation or am I missing something else ?

like image 551
Peter Avatar asked Jan 08 '17 22:01

Peter


2 Answers

Implementations of towupper are not required by the C++ standard to perform Unicode case conversions. Even if wide strings are Unicode strings. Even in cases where one lower-case codepoint mapps to one upper-case one.

Furthermore, towupper is incapable of performing proper Unicode case conversion, even if the implementation supported it. Case conversion can actually change the number of codepoints in a Unicode character sequence. And towupper is incapable of doing that.

You cannot rely on the C++ standard library for dealing with Unicode matters of this sort. You'll need to move to a dedicated Unicode library like ICU.

like image 58
Nicol Bolas Avatar answered Sep 29 '22 19:09

Nicol Bolas


On Windows this will work : EDIT Just realised you're using Borland, not Msvc.

 #include <cctype>
 #include <clocale>

 int main(int argc, char** argv)
 {
    setlocale(LC_ALL, "polish");

    wchar_t c[2] = { L'ś', L'ć'};
    wchar_t c1 = _towupper_l(c[0], _get_current_locale());
    wchar_t c2 = _towupper_l(c[1], _get_current_locale());

    return 0:
}

You first need to set the locale to 'polish' by using setlocale. And then use _towupper_l. Here's a link that tells you what strings, referring to a specific language, can be used with setlocale.

EDIT : Note that if I print the results :

_wprintf_l(L" c1 = %c, c2 = %c\n", _get_current_locale(),  c1, c2);

The output will be :

c1 = S, c2 = C

But if I watch the values of C1 and C2 in my debugger, I can see the correct results, with the accents. My console will just not print that kind of characters.

like image 21
nikau6 Avatar answered Sep 29 '22 20:09

nikau6