Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF8 to/from wide char conversion in STL

Is it possible to convert UTF8 string in a std::string to std::wstring and vice versa in a platform independent manner? In a Windows application I would use MultiByteToWideChar and WideCharToMultiByte. However, the code is compiled for multiple OSes and I'm limited to standard C++ library.

like image 979
Vladimir Grigorov Avatar asked Sep 29 '08 12:09

Vladimir Grigorov


People also ask

Does STD string support UTF-8?

UTF-8 actually works quite well in std::string . Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.

What encoding is wchar_t?

The wchar_t type is an implementation-defined wide character type. In the Microsoft compiler, it represents a 16-bit wide character used to store Unicode encoded as UTF-16LE, the native character type on Windows operating systems.

Is wchar_t a UTF-16?

And wchar_t is utf-16 on Windows. So on Windows the conversion function can just do a memcpy :-) On everything else, the conversion is algorithmic, and pretty simple.

Is UTF-16 better than UTF-8?

UTF-16 is better where ASCII is not predominant, since it uses 2 bytes per character, primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters.


2 Answers

I've asked this question 5 years ago. This thread was very helpful for me back then, I came to a conclusion, then I moved on with my project. It is funny that I needed something similar recently, totally unrelated to that project from the past. As I was researching for possible solutions, I stumbled upon my own question :)

The solution I chose now is based on C++11. The boost libraries that Constantin mentions in his answer are now part of the standard. If we replace std::wstring with the new string type std::u16string, then the conversions will look like this:

UTF-8 to UTF-16

std::string source; ... std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert; std::u16string dest = convert.from_bytes(source);     

UTF-16 to UTF-8

std::u16string source; ... std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert; std::string dest = convert.to_bytes(source);     

As seen from the other answers, there are multiple approaches to the problem. That's why I refrain from picking an accepted answer.

like image 149
Vladimir Grigorov Avatar answered Oct 10 '22 02:10

Vladimir Grigorov


The problem definition explicitly states that the 8-bit character encoding is UTF-8. That makes this a trivial problem; all it requires is a little bit-twiddling to convert from one UTF spec to another.

Just look at the encodings on these Wikipedia pages for UTF-8, UTF-16, and UTF-32.

The principle is simple - go through the input and assemble a 32-bit Unicode code point according to one UTF spec, then emit the code point according to the other spec. The individual code points need no translation, as would be required with any other character encoding; that's what makes this a simple problem.

Here's a quick implementation of wchar_t to UTF-8 conversion and vice versa. It assumes that the input is already properly encoded - the old saying "Garbage in, garbage out" applies here. I believe that verifying the encoding is best done as a separate step.

std::string wchar_to_UTF8(const wchar_t * in) {     std::string out;     unsigned int codepoint = 0;     for (in;  *in != 0;  ++in)     {         if (*in >= 0xd800 && *in <= 0xdbff)             codepoint = ((*in - 0xd800) << 10) + 0x10000;         else         {             if (*in >= 0xdc00 && *in <= 0xdfff)                 codepoint |= *in - 0xdc00;             else                 codepoint = *in;              if (codepoint <= 0x7f)                 out.append(1, static_cast<char>(codepoint));             else if (codepoint <= 0x7ff)             {                 out.append(1, static_cast<char>(0xc0 | ((codepoint >> 6) & 0x1f)));                 out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));             }             else if (codepoint <= 0xffff)             {                 out.append(1, static_cast<char>(0xe0 | ((codepoint >> 12) & 0x0f)));                 out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));                 out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));             }             else             {                 out.append(1, static_cast<char>(0xf0 | ((codepoint >> 18) & 0x07)));                 out.append(1, static_cast<char>(0x80 | ((codepoint >> 12) & 0x3f)));                 out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));                 out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));             }             codepoint = 0;         }     }     return out; } 

The above code works for both UTF-16 and UTF-32 input, simply because the range d800 through dfff are invalid code points; they indicate that you're decoding UTF-16. If you know that wchar_t is 32 bits then you could remove some code to optimize the function.

std::wstring UTF8_to_wchar(const char * in) {     std::wstring out;     unsigned int codepoint;     while (*in != 0)     {         unsigned char ch = static_cast<unsigned char>(*in);         if (ch <= 0x7f)             codepoint = ch;         else if (ch <= 0xbf)             codepoint = (codepoint << 6) | (ch & 0x3f);         else if (ch <= 0xdf)             codepoint = ch & 0x1f;         else if (ch <= 0xef)             codepoint = ch & 0x0f;         else             codepoint = ch & 0x07;         ++in;         if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))         {             if (sizeof(wchar_t) > 2)                 out.append(1, static_cast<wchar_t>(codepoint));             else if (codepoint > 0xffff)             {                 out.append(1, static_cast<wchar_t>(0xd800 + (codepoint >> 10)));                 out.append(1, static_cast<wchar_t>(0xdc00 + (codepoint & 0x03ff)));             }             else if (codepoint < 0xd800 || codepoint >= 0xe000)                 out.append(1, static_cast<wchar_t>(codepoint));         }     }     return out; } 

Again if you know that wchar_t is 32 bits you could remove some code from this function, but in this case it shouldn't make any difference. The expression sizeof(wchar_t) > 2 is known at compile time, so any decent compiler will recognize dead code and remove it.

like image 44
Mark Ransom Avatar answered Oct 10 '22 01:10

Mark Ransom