How to convert Unicode string into a utf-8 or utf-16 string? My VS2005 project is using Unicode char set, while sqlite in cpp provide
int sqlite3_open(
const char *filename, /* Database filename (UTF-8) */
sqlite3 **ppDb /* OUT: SQLite db handle */
);
int sqlite3_open16(
const void *filename, /* Database filename (UTF-16) */
sqlite3 **ppDb /* OUT: SQLite db handle */
);
for opening a folder. How can I convert string, CString, or wstring into UTF-8 or UTF-16 charset?
Thanks very much!
Short answer:
No conversion required if you use Unicode strings such as CString or wstring. Use sqlite3_open16().
You will have to make sure you pass a WCHAR pointer (casted to void *
. Seems lame! Even if this lib is cross platform, I guess they could have defined a wide char type that depends on the platform and is less unfriendly than a void *
) to the API. Such as for a CString: (void*)(LPCWSTR)strFilename
The longer answer:
You don't have a Unicode string that you want to convert to UTF8 or UTF16. You have a Unicode string represented in your program using a given encoding: Unicode is not a binary representation per se. Encodings say how the Unicode code points (numerical values) are represented in memory (binary layout of the number). UTF8 and UTF16 are the most widely used encodings. They are very different though.
When a VS project says "Unicode charset", it actually means "characters are encoded as UTF16". Therefore, you can use sqlite3_open16() directly. No conversion required. Characters are stored in WCHAR type (as opposed to char
) which takes 16 bits (Fallsback on standard C type wchar_t
, which takes 16 bits on Win32. Might be different on other platforms. Thanks for the correction, Checkers).
There's one more detail that you might want to pay attention to: UTF16 exists in 2 flavors: Big Endian and Little Endian. That's the byte ordering of these 16 bits. The function prototype you give for UTF16 doesn't say which ordering is used. But you're pretty safe assuming that sqlite uses the same endian-ness as Windows (Little Endian IIRC. I know the order but have always had problem with the names :-) ).
EDIT: Answer to comment by Checkers:
UTF16 uses 16 bits code units. Under Win32 (and only on Win32), wchar_t
is used for such storage unit. The trick is that some Unicode characters require a sequence of 2 such 16-bits code units. They are called Surrogate Pairs.
The same way an UTF8 represents 1 character using a 1 to 4 bytes sequence. Yet UTF8 are used with the char
type.
Use the WideCharToMultiByte function. Specify CP_UTF8
for the CodePage
parameter.
CHAR buf[256]; // or whatever
WideCharToMultiByte(
CP_UTF8,
0,
StringToConvert, // the string you have
-1, // length of the string - set -1 to indicate it is null terminated
buf, // output
__countof(buf), // size of the buffer in bytes - if you leave it zero the return value is the length required for the output buffer
NULL,
NULL
);
Also, the default encoding for unicode apps in windows is UTF-16LE, so you might not need to perform any translation and just use the second version sqlite3_open16
.
All the C++ string types are charset neutral. They just settle on a character width, and make no further assumptions. A wstring uses 16-bit characters in Windows, corresponding roughly to utf-16, but it still depends on what you store in the thread. The wstring doesn't in any way enforce that the data you put in it must be valid utf16. Windows uses utf16 when UNICODE is defined though, so most likely your strings are already utf16, and you don't need to do anything.
A few others have suggested using the WideCharToMultiByte function, which is (one of) the way(s) to go to convert utf16 to utf8. But since sqlite can handle utf16, that shouldn't be necessary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With