Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to change case of latin UTF-8 strings in C++?

Tags:

c++

utf-8

stl

In Objective-C, it's dead simple:

NSLog(@"%@", [@"BAÑO" lowercaseString]);  // Outputs "baño".

In C++, what's the equivalent? Can anyone provide valid code for this that produces the same output? Is there a nice STL way to do this without relying on ICU, Boost, or any other 3rd party libs?

My current non-solution is:

using namespace std;
string s = "BAÑO";
wstring w(s.begin(), s.end());
transform(w.begin(), w.end(), w.begin(), towlower);
// w contains "baÑo"
like image 326
drhr Avatar asked May 17 '12 23:05

drhr


2 Answers

The problem turns out to be incredibly complicated in C++. There's only one library I know of that gets it absolutely right taking into consideration unicode normalization and other non-lower-128-ASCII character point issues.

IBM's ICU

It's massive but it does everything correctly. toupper and tolower fall short in this issue unfortunately and there's no other C++ construct available.

like image 98
cppguy Avatar answered Nov 03 '22 11:11

cppguy


There is tolower, which is locale specific, but I don't think it'll work with UTF-8 strings.

The correct solution will always be locale specific, because the case rules depend on the language. For example, the lowercase version of 'I' is not always 'i'.

like image 28
Adrian McCarthy Avatar answered Nov 03 '22 10:11

Adrian McCarthy