Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I encode/decode UTF-16LE byte arrays with a BOM?

Tags:

I need to encode/decode UTF-16 byte arrays to and from java.lang.String. The byte arrays are given to me with a Byte Order Marker (BOM), and I need to encoded byte arrays with a BOM.

Also, because I'm dealing with a Microsoft client/server, I'd like to emit the encoding in little endian (along with the LE BOM) to avoid any misunderstandings. I do realize that with the BOM it should work big endian, but I don't want to swim upstream in the Windows world.

As an example, here is a method which encodes a java.lang.String as UTF-16 in little endian with a BOM:

public static byte[] encodeString(String message) {      byte[] tmp = null;     try {         tmp = message.getBytes("UTF-16LE");     } catch(UnsupportedEncodingException e) {         // should not possible         AssertionError ae =         new AssertionError("Could not encode UTF-16LE");         ae.initCause(e);         throw ae;     }      // use brute force method to add BOM     byte[] utf16lemessage = new byte[2 + tmp.length];     utf16lemessage[0] = (byte)0xFF;     utf16lemessage[1] = (byte)0xFE;     System.arraycopy(tmp, 0,                      utf16lemessage, 2,                      tmp.length);     return utf16lemessage; } 

What is the best way to do this in Java? Ideally I'd like to avoid copying the entire byte array into a new byte array that has two extra bytes allocated at the beginning.

The same goes for decoding such a string, but that's much more straightforward by using the java.lang.String constructor:

public String(byte[] bytes,               int offset,               int length,               String charsetName) 
like image 928
Jared Oberhaus Avatar asked May 18 '09 19:05

Jared Oberhaus


People also ask

What is UTF-16LE BOM?

UTF-16. In UTF-16, a BOM ( U+FEFF ) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code unit of the file or stream.

What does BOM mean in encoding?

A byte order mark (BOM) is a sequence of bytes used to indicate Unicode encoding of a text file. The underlying character code, U+FEFF , takes one of the following forms depending on the character encoding. Bytes. Encoding Form. EF BB BF.

What is UTF-16LE encoding?

UTF-16LE: A character encoding that maps code points of Unicode character set to a sequence of 2 bytes (16 bits). UTF-16LE stands for Unicode Transformation Format - 16-bit Little Endian.

What is the difference between UTF-16LE and UTF 16BE?

UTF-16BE encoding is identical to the Big-Endian without BOM format of UTF-16 encoding. UTF-16LE encoding is identical to the Little-Endian with BOM format of UTF-16 encoding without using BOM.


1 Answers

The "UTF-16" charset name will always encode with a BOM and will decode data using either big/little endianness, but "UnicodeBig" and "UnicodeLittle" are useful for encoding in a specific byte order. Use UTF-16LE or UTF-16BE for no BOM - see this post for how to use "\uFEFF" to handle BOMs manually. See here for canonical naming of charset string names or (preferably) the Charset class. Also take note that only a limited subset of encodings are absolutely required to be supported.

like image 137
McDowell Avatar answered Sep 30 '22 18:09

McDowell