Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the optimal multiplatform way of dealing with Unicode strings under C++?

Tags:

I know that there are already several questions on StackOverflow about std::string versus std::wstring or similar but none of them proposed a full solution.

In order to obtain a good answer I should define the requirements:

  • multiplatform usage, must work on Windows, OS X and Linux
  • minimal effort for conversion to/from platform specific Unicode strings like CFStringRef, wchar_t *, char* as UTF-8 or other types as they are required by OS API. Remark: I don't need code-page convertion support because I expect to use only Unicode compatible functions on all operating systems supported.
  • if requires an external library, this one should be open-source and under a very liberal license like BSD but not LGPL.
  • be able to use a printf format syntax or similar.
  • easy way of string allocation/deallocation
  • performance is not very important because I assume that the Unicode strings are used only for application UI.
  • some example could would be appreciated

I would really appreciate only one proposed solution per answer, by doing this people may vote for their prefered alternative. If you have more than one alternative just add another answer.

Please indicate something that did worked for you.

Related questions:

  • stdwstring-vs-stdstring
  • does-c0x-support-stdwstring-conversion-to-from-utf-8-byte-sequence
  • portable-wchart-in-c
like image 544
sorin Avatar asked Jan 10 '10 17:01

sorin


People also ask

Does C use UTF-8?

Most C string library routines still work with UTF-8, since they only scan for terminating NUL characters.

What encoding does c++ use?

Unicode text can be encoded in various formats: The two most important ones are UTF-8 and UTF-16. In C++ Windows code there's often a need to convert between UTF-8 and UTF-16, because Unicode-enabled Win32 APIs use UTF-16 as their native Unicode encoding.


2 Answers

I would strongly recommend using UTF-8 internally in your application, using regular old char* or std::string for data storage. For interfacing with APIs that use a different encoding (ASCII, UTF-16, etc.), I'd recommend using libiconv, which is licensed under the LGPL.

Example usage:

class TempWstring
{
public:
  TempWstring(const char *str)
  {
    assert(sUTF8toUTF16 != (iconv_t)-1);
    size_t inBytesLeft = strlen(str);
    size_t outBytesLeft = 2 * (inBytesLeft + 1);  // worst case
    mStr = new char[outBytesLeft];
    char *outBuf = mStr;
    int result = iconv(sUTF8toUTF16, &str, &inBytesLeft, &outBuf, &outBytesLeft);
    assert(result == 0 && inBytesLeft == 0);
  }

  ~TempWstring()
  {
    delete [] mStr;
  }

  const wchar_t *Str() const { return (wchar_t *)mStr; }

  static void Init()
  {
    sUTF8toUTF16 = iconv_open("UTF-16LE", "UTF-8");
    assert(sUTF8toUTF16 != (iconv_t)-1);
  }

  static void Shutdown()
  {
    int err = iconv_close(sUTF8toUTF16);
    assert(err == 0);
  }

private:
  char *mStr;

  static iconv_t sUTF8toUTF16;
};

iconv_t TempWstring::sUTF8toUTF16 = (iconv_t)-1;

// At program startup:
TempWstring::Init();

// At program termination:
TempWstring::Shutdown();

// Now, to convert a UTF-8 string to a UTF-16 string, just do this:
TempWstring x("Entr\xc3\xa9""e");  // "Entrée"
const wchar_t *ws = x.Str();  // valid until x goes out of scope

// A less contrived example:
HWND hwnd = CreateWindowW(L"class name",
                          TempWstring("UTF-8 window title").Str(),
                          dwStyle, x, y, width, height, parent, menu, hInstance, lpParam);
like image 116
Adam Rosenfield Avatar answered Sep 28 '22 04:09

Adam Rosenfield


Same as Adam Rosenfield answer (+1), but I use UTFCPP instead.

like image 23
Klaim Avatar answered Sep 28 '22 03:09

Klaim