How do I encode/decode UTF-16LE byte arrays with a BOM?

Tags:

I need to encode/decode UTF-16 byte arrays to and from java.lang.String. The byte arrays are given to me with a Byte Order Marker (BOM), and I need to encoded byte arrays with a BOM.

Also, because I'm dealing with a Microsoft client/server, I'd like to emit the encoding in little endian (along with the LE BOM) to avoid any misunderstandings. I do realize that with the BOM it should work big endian, but I don't want to swim upstream in the Windows world.

As an example, here is a method which encodes a java.lang.String as UTF-16 in little endian with a BOM:

public static byte[] encodeString(String message) {      byte[] tmp = null;     try {         tmp = message.getBytes("UTF-16LE");     } catch(UnsupportedEncodingException e) {         // should not possible         AssertionError ae =         new AssertionError("Could not encode UTF-16LE");         ae.initCause(e);         throw ae;     }      // use brute force method to add BOM     byte[] utf16lemessage = new byte[2 + tmp.length];     utf16lemessage[0] = (byte)0xFF;     utf16lemessage[1] = (byte)0xFE;     System.arraycopy(tmp, 0,                      utf16lemessage, 2,                      tmp.length);     return utf16lemessage; }

What is the best way to do this in Java? Ideally I'd like to avoid copying the entire byte array into a new byte array that has two extra bytes allocated at the beginning.

The same goes for decoding such a string, but that's much more straightforward by using the java.lang.String constructor:

public String(byte[] bytes,               int offset,               int length,               String charsetName)

928

asked May 18 '09 19:05

Jared Oberhaus

1 Answers

The "UTF-16" charset name will always encode with a BOM and will decode data using either big/little endianness, but "UnicodeBig" and "UnicodeLittle" are useful for encoding in a specific byte order. Use UTF-16LE or UTF-16BE for no BOM - see this post for how to use "\uFEFF" to handle BOMs manually. See here for canonical naming of charset string names or (preferably) the Charset class. Also take note that only a limited subset of encodings are absolutely required to be supported.

137

answered Sep 30 '22 18:09

McDowell

Related questions
                            
                                If SOA is dead, what's replacing it? [closed]
                            
                                Excessive use of `this` in C++ [duplicate]
                            
                                Performance impact of -fno-strict-aliasing
                            
                                I want to run Selenium test case file from command line
                            
                                How to relative scale size of User Control?
                            
                                iPhone - dealloc - Release vs. nil
                            
                                Python: json.loads chokes on escapes
                            
                                How do I make a Textbox Postback on KeyUp?
                            
                                WPF, UserControl or DataTemplate
                            
                                javascript, circular references and memory leaks
                            
                                Fastest way to store easily editable config data in PHP?
                            
                                Why does Gnu Octave have negative zeroes?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With