The question title is basically what I'd like to ask:
[MarshalAs(UnmanagedType.LPStr)]
- how does this convert utf-8 strings to char* ?
I use the above line when I attempt to communicate between c# and c++ dlls; more specifically, between:
somefunction(char *string) [c++ dll]
somefunction([MarshalAs(UnmanagedType.LPStr) string text) [c#]
When I send my utf-8 text (scintilla.Text) through c# and into my c++ dll, I'm shown in my VS 10 debugger that:
the c# string was successfully converted to char*
the resulting char*
properly reflects the corresponding utf-8 chars (including the bit in Korean) in the watch window.
Here's a screenshot (with more details):
As you can see, initialScriptText[0]
returns the single byte(char)
: 'B' and the contents of char* initialScriptText
are displayed properly (including Korean) in the VS watch window.
Going through the char
pointer, it seems that English is saved as one byte
per char
, while Korean seems to be saved as two bytes per char
. (the Korean word in the screenshot is 3 letters, hence saved in 6 bytes)
This seems to show that each 'letter' isn't saved in equal size containers, but differs depending on language. (possible hint on type?)
I'm trying to achieve the same result in pure c++: reading in utf-8 files and saving the result as char*
.
Here's an example of my attempt to read a utf-8 file and convert to char*
in c++:
observations:
wchar_t*
to char*
wchar_t*
successfully to char*
char*
. (the screenshot also shows my terrible failure in using wcstombs)
note: I'm using the utf8 header from (http://utfcpp.sourceforge.net/)
Please correct me on any mistakes in my code/observations.
I'd like to be able to mimic the result I'm getting through the c# marshal and I've realised after going through all this that I'm completely stuck. Any ideas?
[MarshalAs(UnmanagedType.LPStr)] - how does this convert utf-8 strings to char* ?
It doesn't. There is no such thing as a "utf-8 string" in managed code, strings are always encoded in utf-16. The marshaling from and to an LPStr is done with the default system code page. Which makes it fairly remarkable that you see Korean glyphs in the debugger, unless you use code page 949.
If interop with utf-8 is a hard requirement then you need to use a byte[] in the pinvoke declaration. And convert back and forth yourself with System.Text.Encoding.UTF8. Use its GetString() method to convert the byte[] to a string, its GetBytes() method to convert a string to byte[]. Avoid all this if possible by using wchar_t[] in the native code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With