Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strange behavior of std::string with unicode

I have the following piece of code:

#include <iostream>

std::string eps("ε");

int main()
{
    std::cout << eps << '\n';
    return 0;
}

Somehow it compiles with g++ and clang on Ubuntu, and even prints out right character ε. Also I have almost same piece of code which happily reads ε with cin into std::string. By the way, eps.size() is 2.

My question is - how that works? How can we insert unicode character into std::string? My guess is that operating system handles all this work with unicode, but I'm not sure.

EDIT

As with output, I understood that it is terminal who is responsible for showing me right character (ε in this case).

But with input: cin reads symbols to ' ' or any other space character (and as I understand byte by byte). So, if I take Ƞ, which second byte is 32 ' ' it will read only first byte, and then stop. But it reads Ƞ. How?

like image 297
justanothercoder Avatar asked Dec 13 '14 19:12

justanothercoder


People also ask

Can std::string contain Unicode?

@MSalters: std::string can hold 100% of all Unicode characters, even if CHAR_BIT is 8. It depends on the encoding of std::string, which may be UTF-8 on the system level (like almost everywhere except for windows) or on your application level.

Should I use string or Wstring?

These are the two classes that you will actually use. std::string is used for standard ascii and utf-8 strings. std::wstring is used for wide-character/unicode (utf-16) strings. There is no built-in class for utf-32 strings (though you should be able to extend your own from basic_string if you need one).

Is std::string utf8?

On macOS specifically, std::string is UTF-8 (8-bit code units), and std::wstring is UTF-32 (32-bit code units); note that the size of wchar_t is platform-dependent. For both, size tracks the number of code units instead of the number of code points, or grapheme clusters.

Is string the same as std::string?

There is no functionality difference between string and std::string because they're the same type.


1 Answers

The most likely reason is that everything is getting encoded in UTF-8, as it does on my system:

$ xxd test.cpp
...
0000020: 2065 7073 2822 ceb5 2229 3b0a 0a69 6e74   eps("..");..int
                        ^^^^ ε in UTF-8                 ^^ TWO bytes!
...
$ g++ -o test.out test.cpp
$ ./test.out 
ε
$ ./test.out | xxd
0000000: ceb5 0a
         ^^^^              
like image 169
NPE Avatar answered Sep 30 '22 16:09

NPE