Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert Unicode string into a utf-8 or utf-16 string?

How to convert Unicode string into a utf-8 or utf-16 string? My VS2005 project is using Unicode char set, while sqlite in cpp provide

int sqlite3_open(
  const char *filename,   /* Database filename (UTF-8) */
  sqlite3 **ppDb          /* OUT: SQLite db handle */
);
int sqlite3_open16(
  const void *filename,   /* Database filename (UTF-16) */
  sqlite3 **ppDb          /* OUT: SQLite db handle */
);

for opening a folder. How can I convert string, CString, or wstring into UTF-8 or UTF-16 charset?

Thanks very much!

like image 668
user25749 Avatar asked Nov 11 '08 08:11

user25749


3 Answers

Short answer:

No conversion required if you use Unicode strings such as CString or wstring. Use sqlite3_open16(). You will have to make sure you pass a WCHAR pointer (casted to void *. Seems lame! Even if this lib is cross platform, I guess they could have defined a wide char type that depends on the platform and is less unfriendly than a void *) to the API. Such as for a CString: (void*)(LPCWSTR)strFilename

The longer answer:

You don't have a Unicode string that you want to convert to UTF8 or UTF16. You have a Unicode string represented in your program using a given encoding: Unicode is not a binary representation per se. Encodings say how the Unicode code points (numerical values) are represented in memory (binary layout of the number). UTF8 and UTF16 are the most widely used encodings. They are very different though.

When a VS project says "Unicode charset", it actually means "characters are encoded as UTF16". Therefore, you can use sqlite3_open16() directly. No conversion required. Characters are stored in WCHAR type (as opposed to char) which takes 16 bits (Fallsback on standard C type wchar_t, which takes 16 bits on Win32. Might be different on other platforms. Thanks for the correction, Checkers).

There's one more detail that you might want to pay attention to: UTF16 exists in 2 flavors: Big Endian and Little Endian. That's the byte ordering of these 16 bits. The function prototype you give for UTF16 doesn't say which ordering is used. But you're pretty safe assuming that sqlite uses the same endian-ness as Windows (Little Endian IIRC. I know the order but have always had problem with the names :-) ).

EDIT: Answer to comment by Checkers:

UTF16 uses 16 bits code units. Under Win32 (and only on Win32), wchar_t is used for such storage unit. The trick is that some Unicode characters require a sequence of 2 such 16-bits code units. They are called Surrogate Pairs.

The same way an UTF8 represents 1 character using a 1 to 4 bytes sequence. Yet UTF8 are used with the char type.

like image 165
Serge Wautier Avatar answered Sep 30 '22 13:09

Serge Wautier


Use the WideCharToMultiByte function. Specify CP_UTF8 for the CodePage parameter.

CHAR buf[256]; // or whatever
WideCharToMultiByte(
  CP_UTF8, 
  0, 
  StringToConvert, // the string you have
  -1, // length of the string - set -1 to indicate it is null terminated
  buf, // output
  __countof(buf), // size of the buffer in bytes - if you leave it zero the return value is the length required for the output buffer
  NULL,    
  NULL
);

Also, the default encoding for unicode apps in windows is UTF-16LE, so you might not need to perform any translation and just use the second version sqlite3_open16.

like image 34
1800 INFORMATION Avatar answered Sep 30 '22 15:09

1800 INFORMATION


All the C++ string types are charset neutral. They just settle on a character width, and make no further assumptions. A wstring uses 16-bit characters in Windows, corresponding roughly to utf-16, but it still depends on what you store in the thread. The wstring doesn't in any way enforce that the data you put in it must be valid utf16. Windows uses utf16 when UNICODE is defined though, so most likely your strings are already utf16, and you don't need to do anything.

A few others have suggested using the WideCharToMultiByte function, which is (one of) the way(s) to go to convert utf16 to utf8. But since sqlite can handle utf16, that shouldn't be necessary.

like image 37
jalf Avatar answered Sep 30 '22 13:09

jalf