I know that the default encoding of XML is UTF-8. All XML consumers MUST and so on and so forth. So this is not just a question whether or not XML has a default encoding.
I also know that the XML-Declarataion <?xml version="1.0" ... ?>
at the beginning of the document itself is optional. And that specifying the encoding therein is optional as well.
So I ask myself if the following two XML-Declarations are two expressions for the exact same thing:
<?xml version="1.0"?>
<?xml version="1.0" encoding="UTF-8"?>
From my own current understanding I would say those are equivalent but I do not know. Has the equivalence of these two declarations been specified somewhere?
(Consider these two example lines being each the first line of an XML document, preceded by any (zero) bytes and being UTF-8 encoded)
UTF-8 Characters in Web Development UTF-8 is the most common character encoding method used on the internet today, and is the default character set for HTML5.
UTF-8 is the default character encoding for XML documents. Character encoding can be studied in our Character Set Tutorial. UTF-8 is also the default encoding for HTML5, CSS, JavaScript, PHP, and SQL.
Unicode Transformation Format, 8-bit encoding form is designed for ease of use with existing ASCII-based systems and enables use of all the characters in the Unicode standard.
If no encoding declaration exists in a document's XML declaration, that XML document is required to use either UTF-8 or UTF-16 encoding.
In isolation, both are equivalent. You have already cited the relevant parts of the specifications which show that both declarations are equivalent.
However XML can have an envelope, such as the HTTP Content-Type
header. The W3C specifies that this envelope information has priority over any other declarations in the file. So for example, if you are retrieving XML via http, you could potentially get this:
HTTP/1.1 200 OK
Content-Type: text/xml
<root/>
In this case, the XML should be read as ascii, because the default charset for text/*
mime types is ascii. This is why you should use application/xml
mime types--these default to utf-8. The "application" prefix means that the relevant application specifications define things like default encoding. (I.e. the XML spec takes over.) With text/*
mime types, the default is ascii and the charset
parameter must be included in the mime type to change charset.
Here's another case:
HTTP/1.1 200 OK
Content-Type: text/xml; charset=win-1252
<?xml version="1.0" encoding="utf-8"?>
<root/>
In this case, a conforming XML processor should read this file as win-1252
, not utf-8
.
Another case:
HTTP/1.1 200 OK
Content-Type: application/xml
<?xml version="1.0" encoding="win-1252"?>
<root/>
Here the encoding is win-1252
.
HTTP/1.1 200 OK
Content-Type: application/xml; charset=ascii
<?xml version="1.0" encoding="win-1252"?>
<root/>
Here the encoding is ascii
.
The Short Answer
Under the very specific circumstances of a UTF-8 encoded document with no external encoding information (which I understand from the comments is what you're interested in), there is no difference between the two declarations.
The long answer is far more interesting though.
What The Spec Says
If you look at Appendix F1 of the XML specification, that explains the process that should be followed to determine the encoding when there is no external encoding information.
If the document is encoded as one of the UTF variants, the parser should be able to detect the encoding within the first 4 bytes, either from the Byte Order Mark, or the start of the XML declaration.
However, according to the spec, it should still read the encoding declaration.
In cases above which do not require reading the encoding declaration to determine the encoding, section 4.3.3 still requires that the encoding declaration, if present, be read and that the encoding name be checked to match the actual encoding of the entity.
If they don't match, according to section 4.3.3:
...it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration
Encoded UTF-16, Declared UTF-8
Let's see what happens in reality when we create an XML document encoded as UTF-16 but with the encoding declaration set to UTF-8.
Opera, Firefox and Chrome all interpret the document as UTF-16, ignoring the encoding declaration. Internet Explorer (version 9 at least), displays a blank document, but no actual error.
So if you include a UTF-8 encoding declaration on your UTF-8 document and someone at a later stage converts it to UTF-16, it'll work in most browsers, but fail in IE (and, I assume, most Microsoft XML APIs). If you had left the encoding declaration off, you would have been fine.
Technically I think IE is the most accurate. The fact that it doesn't display an error as such might be explained by the fact that the error is occurring at the encoding level rather than the XML level. It is assumedly doing its best to interpret the UTF-16 characters as UTF-8, failing to find any characters that decode, and ending up passing on an empty character sequence to the XML parser.
Encoded UTF-8, Declared Otherwise
You might now think that Firefox, Chrome and Opera are just ignoring the encoding declaration altogether, but that's not always the case.
If you encode a document as UTF-8 (with a byte order marker so it's unmistakable as anything else), but set the encoding declaration to Latin1, all of the browsers will successfully decode the content as Latin1, ignoring the UTF-8 BOM.
Again this seems right to me. The fact that the BOM characters aren't valid in Latin1 just means they are silently dropped at the character decoding level.
This doesn't work for all declared encodings on a UTF-8 document though. If the declared encoding is UTF-16, we're back with Opera, Firefox and Chrome ignoring the declared encoding, while Internet Explorer returns a blank document.
Essentially, anything that makes IE return a blank document is going to make other browsers ignore the declared encoding.
Other Inconsistencies
It's also worth mentioning the importance of the Byte Order Mark. According to section 4.3.3 of the spec:
Entities encoded in UTF-16 MUST [...] begin with the Byte Order Mark
However, if you try and read a UTF-16 encoded XML document without a BOM, most browsers will nevertheless accept it as valid. Only Firefox reports it as an XML Parsing Error.
External Encoding Information
Up to now, we've been considering what happens when there is no external encoding information, but, as others have mentioned, if the document is received via HTTP or enclosed in a MIME envelope of some sort, the encoding information from those sources should take preference over the document encoding.
Most of the details for the various XML MIME types are described in RFC3023. However, the reality is somewhat different from what is specified.
First of all, text/xml with an omitted charset parameter should use a charset of US-ASCII, but that requirement has almost always been ignored. Browsers will typically use the value of the XML encoding declaration, or default to UTF-8 if there is none.
Second, if there is a UTF-8 BOM on the document, and the XML encoding declaration is either UTF-8 or not included, the document will be interpreted as UTF-8, regardless of the charset used in the Content-Type.
The only time the encoding from the Content-Type seems to take precedence is when there is no BOM and an explicit charset is specified in the Content-Type.
In any event, there are no cases (involving Content-Type) where including a UTF-8 XML encoding declaration on a UTF-8 document is any different from not having an encoding declaration at all.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With