Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting true UTF-8 characters in Java JNI

Tags:

Is there an easy way to convert a Java string to a true UTF-8 byte array in JNI code?

Unfortunately GetStringUTFChars() almost does what's required but not quite, it returns a "modified" UTF-8 byte sequence. The main difference is that a modified UTF-8 doesn't contain any null characters (so you can treat is an ANSI C null terminated string) but another difference seems to be how Unicode supplementary characters such as emoji are treated.

A character such as U+1F604 "SMILING FACE WITH OPEN MOUTH AND SMILING EYES" is stored as a surrogate pair (two UTF-16 characters U+D83D U+DE04) and has a 4-byte UTF-8 equivalent of F0 9F 98 84, and that is the byte sequence that I get if I convert the string to UTF-8 in Java:

    char[] c = Character.toChars(0x1F604);     String s = new String(c);     System.out.println(s);     for (int i=0; i<c.length; ++i)         System.out.println("c["+i+"] = 0x"+Integer.toHexString(c[i]));     byte[] b = s.getBytes("UTF-8");     for (int i=0; i<b.length; ++i)         System.out.println("b["+i+"] = 0x"+Integer.toHexString(b[i] & 0xFF)); 

The code above prints the following:

😄 c[0] = 0xd83d c[1] = 0xde04 b[0] = 0xf0 b[1] = 0x9f b[2] = 0x98 b[3] = 0x84

However, if I pass 's' into a native JNI method and call GetStringUTFChars() I get 6 bytes. Each of the surrogate pair characters is being converted to a 3-byte sequence independently:

JNIEXPORT void JNICALL Java_EmojiTest_nativeTest(JNIEnv *env, jclass cls, jstring _s) {     const char* sBytes = env->GetStringUTFChars(_s, NULL);     for (int i=0; sBytes[i]!=0; ++i)         fprintf(stderr, "%d: %02x\n", i, sBytes[i]);     env->ReleaseStringUTFChars(_s, sBytes);     return result; } 

0: ed 1: a0 2: bd 3: ed 4: b8 5: 84

The Wikipedia UTF-8 article suggests that GetStringUTFChars() actually returns CESU-8 rather than UTF-8. That in turn causes my native Mac code to crash because it's not a valid UTF-8 sequence:

CFStringRef str = CFStringCreateWithCString(NULL, path, kCFStringEncodingUTF8); CFURLRef url = CFURLCreateWithFileSystemPath(NULL, str, kCFURLPOSIXPathStyle, false); 

I suppose I could change all my JNI methods to take a byte[] rather than a String and do the UTF-8 conversion in Java but that seems a bit ugly, is there a better solution?

like image 543
Rolf Avatar asked Aug 25 '15 13:08

Rolf


People also ask

How do I convert to UTF-8 in Java?

In Java, the OutputStreamWriter accepts a charset to encode the character streams into byte streams. We can pass a StandardCharsets. UTF_8 into the OutputStreamWriter constructor to write data to a UTF-8 file.

Is Java UTF-8 or 16?

The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units (that is, sequences of chars) and sequences of bytes.

Can UTF-8 support all characters?

UTF-8 supports any unicode character, which pragmatically means any natural language (Coptic, Sinhala, Phonecian, Cherokee etc), as well as many non-spoken languages (Music notation, mathematical symbols, APL). The stated objective of the Unicode consortium is to encompass all communications.

Does Java uses UTF-8 as its internal character representation?

Java uses UTF-16 for the internal text representation and supports a non-standard modification of UTF-8 for string serialization.


1 Answers

This is clearly explained in the Java documentation:

JNI Functions

GetStringUTFChars

const char * GetStringUTFChars(JNIEnv *env, jstring string, jboolean *isCopy); 

Returns a pointer to an array of bytes representing the string in modified UTF-8 encoding. This array is valid until it is released by ReleaseStringUTFChars().

Modified UTF-8

The JNI uses modified UTF-8 strings to represent various string types. Modified UTF-8 strings are the same as those used by the Java VM. Modified UTF-8 strings are encoded so that character sequences that contain only non-null ASCII characters can be represented using only one byte per character, but all Unicode characters can be represented.

All characters in the range \u0001 to \u007F are represented by a single byte, as follows:

table1

The seven bits of data in the byte give the value of the character represented.

The null character ('\u0000') and characters in the range '\u0080' to '\u07FF' are represented by a pair of bytes x and y:

table2

The bytes represent the character with the value ((x & 0x1f) << 6) + (y & 0x3f).

Characters in the range '\u0800' to '\uFFFF' are represented by 3 bytes x, y, and z:

table3

The character with the value ((x & 0xf) << 12) + ((y & 0x3f) << 6) + (z & 0x3f) is represented by the bytes.

Characters with code points above U+FFFF (so-called supplementary characters) are represented by separately encoding the two surrogate code units of their UTF-16 representation. Each of the surrogate code units is represented by three bytes. This means, supplementary characters are represented by six bytes, u, v, w, x, y, and z:

table4

The character with the value 0x10000+((v&0x0f)<<16)+((w&0x3f)<<10)+(y&0x0f)<<6)+(z&0x3f) is represented by the six bytes.

The bytes of multibyte characters are stored in the class file in big-endian (high byte first) order.

There are two differences between this format and the standard UTF-8 format. First, the null character (char)0 is encoded using the two-byte format rather than the one-byte format. This means that modified UTF-8 strings never have embedded nulls. Second, only the one-byte, two-byte, and three-byte formats of standard UTF-8 are used. The Java VM does not recognize the four-byte format of standard UTF-8; it uses its own two-times-three-byte format instead.

For more information regarding the standard UTF-8 format, see section 3.9 Unicode Encoding Forms of The Unicode Standard, Version 4.0.

Since U+1F604 is a supplementary character, and Java does not support UTF-8's 4-byte encoding format, U+1F604 is represented in modified UTF-8 by encoding the UTF-16 surrogate pair U+D83D U+DE04 using 3 bytes per surrogate, thus 6 bytes total.

So, to answer your question...

Is there an easy way to convert a Java string to a true UTF-8 byte array in JNI code?

You can either:

  1. Use GetStringChars() to get the original UTF-16 encoded characters, and then create your own UTF-8 byte array from that. The conversion from UTF-16 to UTF-8 is a very simply algorithm to implement by hand, or you can use any pre-existing implementation provided by your platform or 3rd party libraries.

  2. Have your JNI code call back into Java to invoke the String.getBytes(String charsetName) method to encode the jstring object to a UTF-8 byte array, eg:

    JNIEXPORT void JNICALL Java_EmojiTest_nativeTest(JNIEnv *env, jclass cls, jstring _s) {     const jclass stringClass = env->GetObjectClass(_s);     const jmethodID getBytes = env->GetMethodID(stringClass, "getBytes", "(Ljava/lang/String;)[B");      const jstring charsetName = env->NewStringUTF("UTF-8");     const jbyteArray stringJbytes = (jbyteArray) env->CallObjectMethod(_s, getBytes, charsetName);     env->DeleteLocalRef(charsetName);      const jsize length = env->GetArrayLength(stringJbytes);     const jbyte* pBytes = env->GetByteArrayElements(stringJbytes, NULL);       for (int i = 0; i < length; ++i)         fprintf(stderr, "%d: %02x\n", i, pBytes[i]);      env->ReleaseByteArrayElements(stringJbytes, pBytes, JNI_ABORT);      env->DeleteLocalRef(stringJbytes); } 

The Wikipedia UTF-8 article suggests that GetStringUTFChars() actually returns CESU-8 rather than UTF-8

Java's Modified UTF-8 is not exactly the same as CESU-8:

CESU-8 is similar to Java's Modified UTF-8 but does not have the special encoding of the NUL character (U+0000).

like image 175
Remy Lebeau Avatar answered Sep 28 '22 06:09

Remy Lebeau