Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing unicode UTF-8 string in std::string

In response to discussion in

Cross-platform strings (and Unicode) in C++

How to deal with Unicode strings in C/C++ in a cross-platform friendly way?

I'm trying to assign a UTF-8 string to a std::string variable in Visual Studio 2010 environment

std::string msg = "महसुस";

However, when I view the string view debugger, I only see "?????" I have the file saved as Unicode (UTF-8 with Signature) and i'm using character set "use unicode character set"

"महसुस" is a nepali language and it contains 5 characters and will occupy 15 bytes. But visual studio debugger shows msg size as 5

My question is:

How do I use std::string to just store the utf-8 without needing to manipulate it?

like image 205
Pritesh Acharya Avatar asked Apr 24 '14 09:04

Pritesh Acharya


People also ask

Can std::string hold Unicode?

@MSalters: std::string can hold 100% of all Unicode characters, even if CHAR_BIT is 8. It depends on the encoding of std::string, which may be UTF-8 on the system level (like almost everywhere except for windows) or on your application level.

Does std::string support UTF-8?

std::string doesn't "use" any encoding, neither UTF-8 nor EBCDIC. std::string is just a container for bytes of types char . You can put UTF-8 strings in there, or ASCII strings, or EBCDIC strings, or even binary data.

Does C++ string support Unicode?

C++ provides a wide-character type, wchar_t , which can store Unicode strings. The exact implementation of wchar_t is implementation defined, but it is often UTF-32. The class wstring , defined in <string> , is a sequence of wchar_t s, just like the string class is a sequence of char s.

Should I use Wstring or string?

These are the two classes that you will actually use. std::string is used for standard ascii and utf-8 strings. std::wstring is used for wide-character/unicode (utf-16) strings. There is no built-in class for utf-32 strings (though you should be able to extend your own from basic_string if you need one).


2 Answers

If you were using C++11 then this would be easy:

std::string msg = u8"महसुस";

But since you are not, you can use escape sequences and not rely on the source file's charset to manage the encoding for you, this way your code is more portable (in case you accidentally save it in a non-UTF8 format):

std::string msg = "\xE0\xA4\xAE\xE0\xA4\xB9\xE0\xA4\xB8\xE0\xA5\x81\xE0\xA4\xB8"; // "महसुस"

Otherwise, you might consider doing a conversion at runtime instead:

std::string toUtf8(const std::wstring &str)
{
    std::string ret;
    int len = WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0, NULL, NULL);
    if (len > 0)
    {
        ret.resize(len);
        WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), &ret[0], len, NULL, NULL);
    }
    return ret;
}

std::string msg = toUtf8(L"महसुस");
like image 84
Remy Lebeau Avatar answered Sep 29 '22 21:09

Remy Lebeau


You can write msg.c_str(), s8 in the Watches window to see the UTF-8 string correctly.

like image 6
Sergey K. Avatar answered Sep 29 '22 21:09

Sergey K.