In documentation of JNI function FindClass I can read about argument name:
name: a fully-qualified class name (...) The string is encoded in modified UTF-8.
According to documentation modified UTF-8 has to end with double '\0' chars:
the null character (char)0 is encoded using the two-byte format rather than the one-byte format
Does it mean that I should invoke FindClass from C in this way:
FindClass("java/lang/String\0")
i.e. with double '\0' at the end?
A Java String contains an immutable sequence of Unicode characters. Unlike C/C++, where string is simply an array of char , A Java String is an object of the class java. lang. String .
Overview. The C language does not have a specific "String" data type, the way some other languages such as C++ and Java do. Instead C stores strings of characters as arrays of chars, terminated by a null byte.
JNI is the Java Native Interface. It defines a way for the bytecode that Android compiles from managed code (written in the Java or Kotlin programming languages) to interact with native code (written in C/C++).
When a Java application passes a string to a native method, it passes the string as a jstring type. This jstring type is different from the regular C string type (char *). If your code tries to print a jstring directly, it will likely result in a VM crash.
Character set, encoding and termination are three different things. Obviously, an encoding is designed for a specific character set but a character set can be encoded in multiple ways. And, often, a terminator (if used) is an encoded character, but with modified UTF-8, this is not the case.
Java uses the Unicode character set. For string and char types, it uses the UTF-16 encoding. The string type is counted; It doesn't use a terminator.
In C, terminated strings are common, as well as single-byte encodings of various character sets. C and C++ compilers terminate literal strings with the NUL character. In the destination character set encoding of the compiler, this is either one or two 0x00 bytes. Almost all common character sets and their encodings have the same byte representation for the non-control ASCII characters. This is true of the UTF-8 encoding of the Unicode character set. (But, note that is not true for characters outside of the limited subset.)
The JNI designers opted to use this limited "interoperability" between C strings. Many JNI functions accept 0x00-terminated modified UTF-8 strings. These are compatible what a C compiler would produce from a literal string in the source code, again provided that the characters are limited to non-control ASCII characters. This covers the use case of writing Java package & class, method and field strings in JNI. (Well, almost: Java allows any Unicode currency symbol in an identifier.)
So, you can pass C string literals to JNI functions in a WYSIWYG style. No need to add a terminator—the compiler does that. The C compiler would encode extra '\0' characters as 0x00 so it wouldn't do any harm but isn't necessary.
There are a couple modifications from the standard UTF-8 encoding. One is to allow C functions that expect a 0x00 terminator to "handle" modified UTF-8 strings, the NUL character (U+00000) is encoded to avoid 0x00, which would be the standard. That allows modified UTF-8 strings to be laid into a buffer with a 0x00 terminator beyond the bytes of the original encoded string. The other modification is a bit esoteric but both modifications make a modified UTF-8 string incompatible with a strictly compliant UTF-8 function.
You didn't ask, but there is another use of 0x00 terminated, modified UTF-8 strings in JNI. It is with the GetStringUTFChars
and NewStringUTF
functions. (The JNI documentation doesn't actually say that GetStringUTFChars
returns a 0x00 terminated string but there are no known JVM implementations that don't. Check your JVM implementor's documentation or source code.) These functions are designed on the same "interoperability" basis. However, the use cases are different, making them dangerous. They are generally used to pass Java strings between C functions. The C functions, generally, would have no idea what modified UTF-8 is, or possibly not even what UTF-8 or Unicode are. It is much more direct to use the Java String
and Charset
classes to convert to and from character sets and encodings that the C functions are designed for. Often, it is a system setting, user setting, application setting or thread setting that determines which a C function is using. The Java String
class attempts to conform to such settings when not given a specific encoding for a conversion. But, it many cases, the desired encoding is fixed and can be specified with clear intent.
No, you don't encode the terminating zero, it is not part of the class name.
No, according to the first reference I found, it means it should be encoded like this:
FindChar("java/lang/String\xc0\x80");
^
|
|
This is not the shortest
way to encode the codepoint
U+0000, which is why it's
"modified" UTF-8.
Note that this assumes that you're really looking for class names whose names end in U+0000, which is rather unlikely. The C string should be terminated just like normal, with a single 0-byte as you get from just:
FindChar("java/lang/String");
The special 2-byte encoding of U+0000 provided by Modified UTF-8 only matters if you want to put U+0000 in a string, and still be able to differentiate it from the C terminator.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With