Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ tolower on special characters such as ü

I have trouble transforming a string to lowercase with the tolower() function in C++. With normal strings, it works as expected, however special characters are not converted successfully.

How I use my function:

string NotLowerCase = "Grüßen";
string LowerCase = "";
for (unsigned int i = 0; i < NotLowerCase.length(); i++) {
    LowerCase += tolower(NotLowerCase[i]);
    }

For example:

  1. Test -> test
  2. TeST2 -> test2
  3. Grüßen -> gr????en
  4. (§) -> ()

3 and 4 are not working as expected as you can see

How can I fix this issue? I have to keep the special chars, but as lowercase.

like image 213
TVA van Hesteren Avatar asked Oct 18 '22 15:10

TVA van Hesteren


1 Answers

The sample code (below) from tolower shows how you fix this; you have to use something other than the default "C" locale.

#include <iostream>
#include <cctype>
#include <clocale>

int main()
{
    unsigned char c = '\xb4'; // the character Ž in ISO-8859-15
                              // but ´ (acute accent) in ISO-8859-1 

    std::setlocale(LC_ALL, "en_US.iso88591");
    std::cout << std::hex << std::showbase;
    std::cout << "in iso8859-1, tolower('0xb4') gives "
              << std::tolower(c) << '\n';
    std::setlocale(LC_ALL, "en_US.iso885915");
    std::cout << "in iso8859-15, tolower('0xb4') gives "
              << std::tolower(c) << '\n';
}

You might also change std::string to std::wstring which is Unicode on many C++ implementations.

wstring NotLowerCase = L"Grüßen";
wstring LowerCase;
for (auto&& ch : NotLowerCase) {
    LowerCase += towlower(ch);
    }

Guidance from Microsoft is to "Normalize strings to uppercase", so you might use toupper or towupper instead.

Keep in mind that a character-by-character transformation might not work well for some languages. For example, using German as spoken in Germany, making Grüßen all upper-case turns it into GRÜESSEN (although there is now a capital ). There are numerous other "problems" such a combining characters; if you're doing real "production" work with strings, you really want a completely different approach.

Finally, C++ has more sophisticated support for managing locales, see <locale> for details.

like image 167
Ðаn Avatar answered Nov 10 '22 07:11

Ðаn