Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XML encoding declaration and endianness

I'm tidying up some of my really old Java code, written to the first edition of the XML spec before XML parsing was included in the JDK libraries, and trying to bring it up to date as well as write some tests. In particular I'm (re)implementing XML character encoding autodetection like this:

  1. I read the BOM, if any.
  2. If there is no BOM, I "impute" a BOM based upon the expected <?xml start of the XML declaration.
  3. I now have enough information (number of bytes per character, endianness, etc.) to read my way over to the encoding= declaration, if any, which according to the XML spec may tell me some more specific or esoteric encoding

So let's say that the file has an actual BOM for UTF-16LE. What should be the value of the XML encoding attribute? Should it be encoding="UTF-16LE"? But the Unicode Byte Order Mark FAQ seems to indicate that, if a UTF-16 family BOM is present, I should "tag the text" as merely UTF-16. Does that mean I should use encoding="UTF-16" in my XML file? But then should my parser ignore the encoding value and go with the more specific charset it has determined from the BOM? I'm starting to confuse myself.

The W3C HTML BOM FAQ seems to indicate that tagging the text refers to "labelled in HTTP", that is, an external charset designation, presumably in the HTTP Content-Encoding header. So perhaps it would be OK to have an XML file starting with a BOM yet containing an XML declaration of UTF-16LE or UTF-16BE. But I have yet to see such an XML file.

If I use a UTF-16LE BOM with an XML file, 1) what value should I use in the encoding attribute, and 2) what charset should my parser autodetect as the encoding of the file?

like image 297
Garret Wilson Avatar asked Aug 25 '14 01:08

Garret Wilson


People also ask

What is encoding in XML declaration?

Encoding is the process of converting unicode characters into their equivalent binary representation. When the XML processor reads an XML document, it encodes the document depending on the type of encoding. Hence, we need to specify the type of encoding in the XML declaration.

Is UTF-8 big or little endian?

UTF-8 uses 3 bytes to present the same character, but it does not have big or little endian.

Is UTF-16 Little endian?

UTF-16 uses code units that are two bytes long. There are three UTF-16 sub-flavors: BE - uses big-endian byte serialization (most significant byte first) LE - uses little-endian byte serialization (least significant byte first)

What encoding does XML use?

XML Encodings xml version="1.0" encoding="ISO-8859-1"?> Without this information, the default encoding is UTF-8 or UTF-16, depending on the presence of a UNICODE byte-order mark (BOM) at the beginning of the XML file.


1 Answers

The key to understanding this is to realize that the UTF-16 encoding scheme is distinct from UTF-16LE and UTF-16BE. UTF-16, little endian, is NOT UTF-16LE.

Note especially point 4 in the last question in the Unicode BOM FAQ. If the encoding is UTF-16BE or UTF-16LE, a BOM MUST NOT be used. You may also refer to 3.10 in the Unicode standard for a formal definition of these "encoding schemes".

So, if you find a BOM for UTF-16, the encoding is UTF-16, NOT UTF-16LE or UTF-16BE (neither of which are allowed to have a BOM). If there is no BOM, the encoding may be any of the three, though in that case, UTF-16 becomes basically indistinguishable from the BE and LE variants. However, note that 4.3.3 of XML 1.1 says "Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark". So in the case of XML, if there is no BOM, then the encoding cannot be UTF-16 (but it may be UTF-16BE or UTF-16LE).

like image 143
Kevin Avatar answered Oct 24 '22 11:10

Kevin