I come from python where you can use 'string[10]' to access a character in sequence. And if the string is encoded in Unicode it will give me expected results. However when I use indexing on a string in C++, as long the characters are ASCII it works, but when I use a Unicode character inside the string and use indexing, in the output I'll get an octal representation like /201. For example:
string ramp = "ÐðŁłŠšÝýÞþŽž";
cout << ramp << "\n";
cout << ramp[5] << "\n";
Output:
ÐðŁłŠšÝýÞþŽž
/201
Why this is happening and how can I access that character in the string representation or how can I convert the octal representation to the actual character?
Standard C++ is not equipped for proper handling of Unicode, giving you problems like the one you observed.
The problem here is that C++ predates Unicode by a comfortable margin. This means that even that string literal of yours will be interpreted in an implementation-defined manner because those characters are not defined in the Basic Source Character set (which is, basically, the ASCII-7 characters minus @
, $
, and the backtick).
C++98 does not mention Unicode at all. It mentions wchar_t
, and wstring
being based on it, specifying wchar_t
as being capable of "representing any character in the current locale". But that did more damage than good...
Microsoft defined wchar_t
as 16 bit, which was enough for the Unicode code points at that time. However, since then Unicode has been extended beyond the 16-bit range... and Windows' 16-bit wchar_t
is not "wide" anymore, because you need two of them to represent characters beyond the BMP -- and the Microsoft docs are notoriously ambiguous as to where wchar_t
means UTF-16 (multibyte encoding with surrogate pairs) or UCS-2 (wide encoding with no support for characters beyond the BMP).
All the while, a Linux wchar_t
is 32 bit, which is wide enough for UTF-32...
C++11 made significant improvements to the subject, adding char16_t
and char32_t
including their associated string
variants to remove the ambiguity, but still it is not fully equipped for Unicode operations.
Just as one example, try to convert e.g. German "Fuß" to uppercase and you will see what I mean. (The single letter 'ß'
would need to expand to 'SS'
, which the standard functions -- handling one character in, one character out at a time -- cannot do.)
However, there is help. The International Components for Unicode (ICU) library is fully equipped to handle Unicode in C++. As for specifying special characters in source code, you will have to use u8""
, u""
, and U""
to enforce interpretation of the string literal as UTF-8, UTF-16, and UTF-32 respectively, using octal / hexadecimal escapes or relying on your compiler implementation to handle non-ASCII-7 encodings appropriately.
And even then you will get an integer value for std::cout << ramp[5]
, because for C++, a character is just an integer with semantic meaning. ICU's ustream.h
provides operator<<
overloads for the icu::UnicodeString
class, but ramp[5]
is just a 16-bit unsigned integer (1), and people would look askance at you if their unsigned short
would suddenly be interpreted as characters. You need the C-API u_fputs()
/ u_printf()
/ u_fprintf()
functions for that.
#include <unicode/unistr.h>
#include <unicode/ustream.h>
#include <unicode/ustdio.h>
#include <iostream>
int main()
{
// make sure your source file is UTF-8 encoded...
icu::UnicodeString ramp( icu::UnicodeString::fromUTF8( "ÐðŁłŠšÝýÞþŽž" ) );
std::cout << ramp << "\n";
std::cout << ramp[5] << "\n";
u_printf( "%C\n", ramp[5] );
}
Compiled with g++ -std=c++11 testme.cpp -licuio -licuuc
.
ÐðŁłŠšÝýÞþŽž
353
š
(1) ICU uses UTF-16 internally, and UnicodeString::operator[]
returns a code unit, not a code point, so you might end up with one half of a surrogate pair. Look up the API docs for the various other ways to index a unicode string.
C++ has no useful native Unicode support. You almost certainly will need an external library like ICU.
To access codepoints individually, use u32string
, which represents a string as a sequence of UTF-32 code units of type char32_t
.
u32string ramp = U"ÐðŁłŠšÝýÞþŽž";
cout << ramp << "\n";
cout << ramp[5] << "\n";
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With