I need to encode/decode UTF-16 byte arrays to and from java.lang.String
. The byte arrays are given to me with a Byte Order Marker (BOM), and I need to encoded byte arrays with a BOM.
Also, because I'm dealing with a Microsoft client/server, I'd like to emit the encoding in little endian (along with the LE BOM) to avoid any misunderstandings. I do realize that with the BOM it should work big endian, but I don't want to swim upstream in the Windows world.
As an example, here is a method which encodes a java.lang.String
as UTF-16
in little endian with a BOM:
public static byte[] encodeString(String message) { byte[] tmp = null; try { tmp = message.getBytes("UTF-16LE"); } catch(UnsupportedEncodingException e) { // should not possible AssertionError ae = new AssertionError("Could not encode UTF-16LE"); ae.initCause(e); throw ae; } // use brute force method to add BOM byte[] utf16lemessage = new byte[2 + tmp.length]; utf16lemessage[0] = (byte)0xFF; utf16lemessage[1] = (byte)0xFE; System.arraycopy(tmp, 0, utf16lemessage, 2, tmp.length); return utf16lemessage; }
What is the best way to do this in Java? Ideally I'd like to avoid copying the entire byte array into a new byte array that has two extra bytes allocated at the beginning.
The same goes for decoding such a string, but that's much more straightforward by using the java.lang.String
constructor:
public String(byte[] bytes, int offset, int length, String charsetName)
UTF-16. In UTF-16, a BOM ( U+FEFF ) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code unit of the file or stream.
A byte order mark (BOM) is a sequence of bytes used to indicate Unicode encoding of a text file. The underlying character code, U+FEFF , takes one of the following forms depending on the character encoding. Bytes. Encoding Form. EF BB BF.
UTF-16LE: A character encoding that maps code points of Unicode character set to a sequence of 2 bytes (16 bits). UTF-16LE stands for Unicode Transformation Format - 16-bit Little Endian.
UTF-16BE encoding is identical to the Big-Endian without BOM format of UTF-16 encoding. UTF-16LE encoding is identical to the Little-Endian with BOM format of UTF-16 encoding without using BOM.
The "UTF-16" charset name will always encode with a BOM and will decode data using either big/little endianness, but "UnicodeBig" and "UnicodeLittle" are useful for encoding in a specific byte order. Use UTF-16LE or UTF-16BE for no BOM - see this post for how to use "\uFEFF" to handle BOMs manually. See here for canonical naming of charset string names or (preferably) the Charset class. Also take note that only a limited subset of encodings are absolutely required to be supported.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With