Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode string indexing in C++

I come from python where you can use 'string[10]' to access a character in sequence. And if the string is encoded in Unicode it will give me expected results. However when I use indexing on a string in C++, as long the characters are ASCII it works, but when I use a Unicode character inside the string and use indexing, in the output I'll get an octal representation like /201. For example:

string ramp = "ÐðŁłŠšÝýÞþŽž";
cout << ramp << "\n";    
cout << ramp[5] << "\n";

Output:

ÐðŁłŠšÝýÞþŽž
/201

Why this is happening and how can I access that character in the string representation or how can I convert the octal representation to the actual character?

like image 264
Bahman Eslami Avatar asked Jul 17 '15 11:07

Bahman Eslami


3 Answers

Standard C++ is not equipped for proper handling of Unicode, giving you problems like the one you observed.

The problem here is that C++ predates Unicode by a comfortable margin. This means that even that string literal of yours will be interpreted in an implementation-defined manner because those characters are not defined in the Basic Source Character set (which is, basically, the ASCII-7 characters minus @, $, and the backtick).

C++98 does not mention Unicode at all. It mentions wchar_t, and wstring being based on it, specifying wchar_t as being capable of "representing any character in the current locale". But that did more damage than good...

Microsoft defined wchar_t as 16 bit, which was enough for the Unicode code points at that time. However, since then Unicode has been extended beyond the 16-bit range... and Windows' 16-bit wchar_t is not "wide" anymore, because you need two of them to represent characters beyond the BMP -- and the Microsoft docs are notoriously ambiguous as to where wchar_t means UTF-16 (multibyte encoding with surrogate pairs) or UCS-2 (wide encoding with no support for characters beyond the BMP).

All the while, a Linux wchar_t is 32 bit, which is wide enough for UTF-32...

C++11 made significant improvements to the subject, adding char16_t and char32_t including their associated string variants to remove the ambiguity, but still it is not fully equipped for Unicode operations.

Just as one example, try to convert e.g. German "Fuß" to uppercase and you will see what I mean. (The single letter 'ß' would need to expand to 'SS', which the standard functions -- handling one character in, one character out at a time -- cannot do.)

However, there is help. The International Components for Unicode (ICU) library is fully equipped to handle Unicode in C++. As for specifying special characters in source code, you will have to use u8"", u"", and U"" to enforce interpretation of the string literal as UTF-8, UTF-16, and UTF-32 respectively, using octal / hexadecimal escapes or relying on your compiler implementation to handle non-ASCII-7 encodings appropriately.

And even then you will get an integer value for std::cout << ramp[5], because for C++, a character is just an integer with semantic meaning. ICU's ustream.h provides operator<< overloads for the icu::UnicodeString class, but ramp[5] is just a 16-bit unsigned integer (1), and people would look askance at you if their unsigned short would suddenly be interpreted as characters. You need the C-API u_fputs() / u_printf() / u_fprintf() functions for that.

#include <unicode/unistr.h>
#include <unicode/ustream.h>
#include <unicode/ustdio.h>

#include <iostream>

int main()
{
    // make sure your source file is UTF-8 encoded...
    icu::UnicodeString ramp( icu::UnicodeString::fromUTF8( "ÐðŁłŠšÝýÞþŽž" ) );
    std::cout << ramp << "\n";
    std::cout << ramp[5] << "\n";
    u_printf( "%C\n", ramp[5] );
}

Compiled with g++ -std=c++11 testme.cpp -licuio -licuuc.

ÐðŁłŠšÝýÞþŽž
353
š

(1) ICU uses UTF-16 internally, and UnicodeString::operator[] returns a code unit, not a code point, so you might end up with one half of a surrogate pair. Look up the API docs for the various other ways to index a unicode string.

like image 181
DevSolar Avatar answered Sep 19 '22 12:09

DevSolar


C++ has no useful native Unicode support. You almost certainly will need an external library like ICU.

like image 38
Puppy Avatar answered Sep 19 '22 12:09

Puppy


To access codepoints individually, use u32string, which represents a string as a sequence of UTF-32 code units of type char32_t.

u32string ramp = U"ÐðŁłŠšÝýÞþŽž";
cout << ramp << "\n";    
cout << ramp[5] << "\n";
like image 37
ecatmur Avatar answered Sep 19 '22 12:09

ecatmur