I'm tidying up some of my really old Java code, written to the first edition of the XML spec before XML parsing was included in the JDK libraries, and trying to bring it up to date as well as write some tests. In particular I'm (re)implementing XML character encoding autodetection like this:
<?xml
start of the XML declaration.encoding=
declaration, if any, which according to the XML spec may tell me some more specific or esoteric encodingSo let's say that the file has an actual BOM for UTF-16LE. What should be the value of the XML encoding
attribute? Should it be encoding="UTF-16LE"
? But the Unicode Byte Order Mark FAQ seems to indicate that, if a UTF-16 family BOM is present, I should "tag the text" as merely UTF-16
. Does that mean I should use encoding="UTF-16"
in my XML file? But then should my parser ignore the encoding
value and go with the more specific charset it has determined from the BOM? I'm starting to confuse myself.
The W3C HTML BOM FAQ seems to indicate that tagging the text refers to "labelled in HTTP", that is, an external charset designation, presumably in the HTTP Content-Encoding
header. So perhaps it would be OK to have an XML file starting with a BOM yet containing an XML declaration of UTF-16LE
or UTF-16BE
. But I have yet to see such an XML file.
If I use a UTF-16LE BOM with an XML file, 1) what value should I use in the encoding
attribute, and 2) what charset should my parser autodetect as the encoding of the file?
Encoding is the process of converting unicode characters into their equivalent binary representation. When the XML processor reads an XML document, it encodes the document depending on the type of encoding. Hence, we need to specify the type of encoding in the XML declaration.
UTF-8 uses 3 bytes to present the same character, but it does not have big or little endian.
UTF-16 uses code units that are two bytes long. There are three UTF-16 sub-flavors: BE - uses big-endian byte serialization (most significant byte first) LE - uses little-endian byte serialization (least significant byte first)
XML Encodings xml version="1.0" encoding="ISO-8859-1"?> Without this information, the default encoding is UTF-8 or UTF-16, depending on the presence of a UNICODE byte-order mark (BOM) at the beginning of the XML file.
The key to understanding this is to realize that the UTF-16 encoding scheme is distinct from UTF-16LE and UTF-16BE. UTF-16, little endian, is NOT UTF-16LE.
Note especially point 4 in the last question in the Unicode BOM FAQ. If the encoding is UTF-16BE or UTF-16LE, a BOM MUST NOT be used. You may also refer to 3.10 in the Unicode standard for a formal definition of these "encoding schemes".
So, if you find a BOM for UTF-16, the encoding is UTF-16, NOT UTF-16LE or UTF-16BE (neither of which are allowed to have a BOM). If there is no BOM, the encoding may be any of the three, though in that case, UTF-16 becomes basically indistinguishable from the BE and LE variants. However, note that 4.3.3 of XML 1.1 says "Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark". So in the case of XML, if there is no BOM, then the encoding cannot be UTF-16 (but it may be UTF-16BE or UTF-16LE).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With