Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is a Java char array always a valid UTF-16 (Big Endian) encoding?

Say that I would encode a Java character array (char[]) instance as bytes:

  • using two bytes for each character
  • using big endian encoding (storing the most significant 8 bits in the leftmost and the least significant 8 bits in the rightmost byte)

Would this always create a valid UTF-16BE encoding? If not, which code points will result in an invalid encoding?


This question is very much related to this question about the Java char type and this question about the internal representation of Java strings.

like image 811
Maarten Bodewes Avatar asked Jul 24 '15 14:07

Maarten Bodewes


People also ask

Is UTF-16 Little endian?

Some notes: 1) "Unicode" in Windows world is specifically UTF-16 Little Endian given that they also have a "BigEndianUnicode" that is UTF-16 BE. 2) UTF-16 is also variable-length due to supplementary characters being comprised of surrogate pairs of two UTF-16 code units.

Does Java use UTF-16?

A Java String (before Java 9) is represented internally in the Java VM using bytes, encoded as UTF-16. UTF-16 uses 2 bytes to represent a single character. Thus, the characters of a Java String are represented using a char array.

What is UTF-16 in Java?

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units.

What are UTF-16 characters?

UTF-16 is an encoding of Unicode in which each character is composed of either one or two 16-bit elements. Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts.


1 Answers

No. You can create char instances that contain any 16-bit value you desire---there is nothing that constrains them to be valid UTF-16 code units, nor constrains an array of them to be a valid UTF-16 sequence. Even String does not require that its data be valid UTF-16:

char data[] = {'\uD800', 'b', 'c'};  // Unpaired lead surrogate
String str = new String(data);

The requirements for valid UTF-16 data are set out in Chapter 3 of the Unicode Standard (basically, everything must be a Unicode scalar value, and all surrogates must be correctly paired). You can test if a char array is a valid UTF-16 sequence, and turn it into a sequence of UTF-16BE (or LE) bytes, by using a CharsetEncoder:

CharsetEncoder encoder = Charset.forName("UTF-16BE").newEncoder();
ByteBuffer bytes = encoder.encode(CharBuffer.wrap(data)); // throws MalformedInputException

(And similarly using a CharsetDecoder if you have bytes.)

like image 67
一二三 Avatar answered Sep 17 '22 10:09

一二三