Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XML Spec and UTF-16

Section 4.3.3 and Appendix F of the XML 1.0 spec speak about UTF-16, the byte order mark (BOM) in UTF-16 encoded data streams, and the XML encoding declaration. From the information in those sections, it would seem that a byte order mark is required in UTF-16 documents. But the summary chart in Appendix F gives a scenario where a UTF-16 input does not have a Byte order mark, but this scenario has an xml declaration. According to section 4.3.3, a UTF-16 encoded document does not require an encoding declaration (and the XML declaration itself is optional in such a case).

Given this information, is a UTF-16 xml document with neither a BOM nor an XML declaration that also lacks externally provided encoding information considered well-formed if the rest of the document is?

like image 356
Mike Menzel Avatar asked Dec 19 '13 21:12

Mike Menzel


People also ask

Does XML support UTF-16?

If you type an XML document into Notepad, you can choose from one of several supported character encodings including ANSI, UTF-8, or UTF-16.

What is UTF-16 in XML?

UTF stands for UCS Transformation Format, and UCS itself means Universal Character Set. The number 8 or 16 refers to the number of bits used to represent a character. They are either 8(1 to 4 bytes) or 16(2 or 4 bytes). For the documents without encoding information, UTF-8 is set by default.

Is XML always UTF-8?

Without this information, the default encoding is UTF-8 or UTF-16, depending on the presence of a UNICODE byte-order mark (BOM) at the beginning of the XML file. If the file starts with a UNICODE byte-order mark (0xFF 0xFE) or (0xFE 0xFF), the document is considered to be in UTF-16 encoding; otherwise, it is in UTF-8.

What is UTF-16 used for?

UTF16 is generally used as a direct mapping to multi-byte character sets, ie onyl the original 0-0xFFFF assigned characters.


1 Answers

From the Unicode 6.2 specification (page 99):

The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian.

So a BOM is not required in a UTF-16 document. But there may be a "higher-level protocol" such as the XML specification to indicate what needs to be done for UTF-16 XML documents without BOM.

Section 4.3.3 in the XML 1.0 specification says:

Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000], section 16.8 of [Unicode] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF).

Let's get back to the above later. Appendix F describes approaches for detecting the character encoding in case a BOM isn't present. But I don't think that section is relevant for your question as you're asking whether a UTF-16 XML document without BOM and without XML declaration is "well-formed" and Appendix F is a non-normative part of the specification.

So, going back to the specification, a document is well-formed if "Taken as a whole, it matches the production labeled document." (Section 2.1). Reviewing document shows that the XML declaration is optional (this is also mentioned in Section 2.8). So it's possible to have a well-formed document without a XML declaration; this answers half of your question.

The other half is whether a UTF-16 XML document without XML declaration but also without BOM can still be well-formed. In Section 4.3.3 it says (emphasis mine):

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8.

Based on this a UTF-16 XML document without BOM and without encoding declaration (which is part of the XML declaration) is not a well-formed document (because a fatal error violates wellformed-ness, see definition of well-formedness constraint in Section 1.2) in the absence of external information. This also matches what was said earlier in Section 4.3.3 about the requirement of a BOM for UTF-16.

like image 68
Simeon Visser Avatar answered Oct 21 '22 03:10

Simeon Visser