XML encoding declaration and endianness

Q: What is encoding in XML declaration?

Encoding is the process of converting unicode characters into their equivalent binary representation. When the XML processor reads an XML document, it encodes the document depending on the type of encoding. Hence, we need to specify the type of encoding in the XML declaration.

Q: Is UTF-8 big or little endian?

UTF-8 uses 3 bytes to present the same character, but it does not have big or little endian.

Q: Is UTF-16 Little endian?

UTF-16 uses code units that are two bytes long. There are three UTF-16 sub-flavors: BE - uses big-endian byte serialization (most significant byte first) LE - uses little-endian byte serialization (least significant byte first)

Q: What encoding does XML use?

XML Encodings xml version="1.0" encoding="ISO-8859-1"?> Without this information, the default encoding is UTF-8 or UTF-16, depending on the presence of a UNICODE byte-order mark (BOM) at the beginning of the XML file.

Tags:

java

parsing

character-encoding

xml

I'm tidying up some of my really old Java code, written to the first edition of the XML spec before XML parsing was included in the JDK libraries, and trying to bring it up to date as well as write some tests. In particular I'm (re)implementing XML character encoding autodetection like this:

I read the BOM, if any.
If there is no BOM, I "impute" a BOM based upon the expected <?xml start of the XML declaration.
I now have enough information (number of bytes per character, endianness, etc.) to read my way over to the encoding= declaration, if any, which according to the XML spec may tell me some more specific or esoteric encoding

So let's say that the file has an actual BOM for UTF-16LE. What should be the value of the XML encoding attribute? Should it be encoding="UTF-16LE"? But the Unicode Byte Order Mark FAQ seems to indicate that, if a UTF-16 family BOM is present, I should "tag the text" as merely UTF-16. Does that mean I should use encoding="UTF-16" in my XML file? But then should my parser ignore the encoding value and go with the more specific charset it has determined from the BOM? I'm starting to confuse myself.

The W3C HTML BOM FAQ seems to indicate that tagging the text refers to "labelled in HTTP", that is, an external charset designation, presumably in the HTTP Content-Encoding header. So perhaps it would be OK to have an XML file starting with a BOM yet containing an XML declaration of UTF-16LE or UTF-16BE. But I have yet to see such an XML file.

If I use a UTF-16LE BOM with an XML file, 1) what value should I use in the encoding attribute, and 2) what charset should my parser autodetect as the encoding of the file?

297

asked Aug 25 '14 01:08

Garret Wilson

1 Answers

The key to understanding this is to realize that the UTF-16 encoding scheme is distinct from UTF-16LE and UTF-16BE. UTF-16, little endian, is NOT UTF-16LE.

Note especially point 4 in the last question in the Unicode BOM FAQ. If the encoding is UTF-16BE or UTF-16LE, a BOM MUST NOT be used. You may also refer to 3.10 in the Unicode standard for a formal definition of these "encoding schemes".

So, if you find a BOM for UTF-16, the encoding is UTF-16, NOT UTF-16LE or UTF-16BE (neither of which are allowed to have a BOM). If there is no BOM, the encoding may be any of the three, though in that case, UTF-16 becomes basically indistinguishable from the BE and LE variants. However, note that 4.3.3 of XML 1.1 says "Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark". So in the case of XML, if there is no BOM, then the encoding cannot be UTF-16 (but it may be UTF-16BE or UTF-16LE).

143

answered Oct 24 '22 11:10

Kevin

Related questions
                            
                                How to read shapes group as an image from Word document(.doc or .docx) using apachePOI?
                            
                                Howto force Java text antialiasing on?
                            
                                Passing JSON objects in a REST HTTP GET request using Spring MVC
                            
                                JPA Inheritance issue
                            
                                How to get the count of deleted entities
                            
                                Could not find com.android.support:support-v13:19.0.0
                            
                                SSL Handshake_failure in Java test client while connecting to server with two-way authentication
                            
                                DateTimeFormatter trouble with a pattern
                            
                                What does "Number of locked synchronizers = 1" in a StackTrace mean?
                            
                                JCIFS - connection breaks
                            
                                Cannot create subscription using Braintree Payment Nonce
                            
                                RequestMappingHandlerMapping.getHandlerInternal:230 - Did not find handler method for
                            
                                Injecting on adapters with dagger in android
                            
                                Java default SecurityManager policy
                            
                                Why do I need a FactorySupplier?
                            
                                jackson serialization for Java object with Map?
                            
                                Android Wear deleting data on DataApi with deleteDataItems
                            
                                jooq Converter: from java.sql.Date to java.time.LocalDate
                            
                                Generating knots procedurally (and visualizing them)
                            
                                UDP Hole Punching (Java)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With