I have a Java application that uses a C++ DLL via JNI. A few of the DLL's methods take string arguments, and some of them return objects that contain strings as well.
Currently the DLL does not support Unicode, so the string handling is rather easy:
I'm now in the process of modifying the DLL to support Unicode, switching to using the TCHAR type (which when UNICODE is defined uses windows' WCHAR datatype). Modifying the DLL is going well, but I'm not sure how to modify the JNI portion of the code.
The only thing I can think of right now is this:
The only problem with this method is that I'm not sure what charset name to use. WCHARs are 2-bytes long, so I'm pretty sure it's UTF-16, but there are 3 posibilities on the java side. UTF-16, UTF-16BE, and UTF-16LE. I haven't found any documentation that tells me what the byte order is, but I can probably figure it out from some quick testing.
Is there a better way? If possible I'd like to continue constructing the jstring objects within the DLL, as that way I won't have to modify any of the usages of those methods. However, the NewString JNI method doesn't take a charset identifier.
This answer suggests that the byte-ordering of WCHARS is not guaranteed...
Since you are on Windows you could try WideCharToMultiByte
to convert the WCHARs to UTF-8 and then use your existing JNI code.
You will need to be careful using WideCharToMultiByte due to the possibility of buffer overruns in the lpMultiByteStr
parameter. To get round this you should call the function twice, first with lpMultiByteStr
set to NULL
and cbMultiByte
set to zero - this will return the length of the required lpMultiByteStr
buffer without attempting to write to it. Once you have the length you can allocate a buffer of the required size and call the function again.
Example code:
int utf8_length;
wchar_t* utf16 = ...;
utf8_length = WideCharToMultiByte(
CP_UTF8, // Convert to UTF-8
0, // No special character conversions required
// (UTF-16 and UTF-8 support the same characters)
utf16, // UTF-16 string to convert
-1, // utf16 is NULL terminated (if not, use length)
NULL, // Determining correct output buffer size
0, // Determining correct output buffer size
NULL, // Must be NULL for CP_UTF8
NULL); // Must be NULL for CP_UTF8
if (utf8_length == 0) {
// Error - call GetLastError for details
}
char* utf8 = ...; // Allocate space for UTF-8 string
utf8_length = WideCharToMultiByte(
CP_UTF8, // Convert to UTF-8
0, // No special character conversions required
// (UTF-16 and UTF-8 support the same characters)
utf16, // UTF-16 string to convert
-1, // utf16 is NULL terminated (if not, use length)
utf8, // UTF-8 output buffer
utf8_length, // UTF-8 output buffer size
NULL, // Must be NULL for CP_UTF8
NULL); // Must be NULL for CP_UTF8
if (utf8_length == 0) {
// Error - call GetLastError for details
}
I found a little faq about the byte order mark. Also from that FAQ:
UTF-16 and UTF-32 use code units that are two and four bytes long respectively. For these UTFs, there are three sub-flavors: BE, LE and unmarked. The BE form uses big-endian byte serialization (most significant byte first), the LE form uses little-endian byte serialization (least significant byte first) and the unmarked form uses big-endian byte serialization by default, but may include a byte order mark at the beginning to indicate the actual byte serialization used.
I'm assuming on the java side the UTF-16 will try to find this BOM and properly deal with the encoding. We all know how dangerous assumptions can be...
Edit because of comment:
Microsoft uses UTF16 little endian. Java UTF-16 tries to interpret the BOM. When lacking a BOM it defaults to UTF-16BE. The BE and LE variants ignore the BOM.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With