Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What use is the 'encoding' in the XML header?

Looking at the XML header

<?xml version="1.0" encoding="UTF-16" standalone="no"?> 

Am I right to state that the encoding attribute is

  • coming too late (you can't read it properly unless you know the encoding...)
  • redundant, hence error-prone: it's all too easy to replace it with "Big5" yet save the file in UTF-8

Or is that attribute not about the content of the stream?

Am I mixing up things here?

like image 735
xtofl Avatar asked Mar 02 '11 09:03

xtofl


People also ask

Does XML need encoding?

Encoding plays a role in XML as the user needs to provide a correct encoding while transferring XML Documents on different platforms. With respective to XML 1.0 specification, the two Unicode UTF -8 and 16 must be supported in the processor automatically.

What does <? XML version 1.0 encoding UTF-8 ?> Mean?

version="1.0" means that this is the XML standard this file conforms to. encoding="utf-8" means that the file is encoded using the UTF-8 Unicode encoding.

Does XML use UTF-8?

You can write the XML file in any text editor. For non-ASCII characters, such as characters with diacritics and Kanji characters, an editor that can save the file as UTF-8 is required. Because UTF-8 is not easily displayed or edited on z/OS®, the XML can be encoded in UTF-8 or using the agent's code page.

What is the default encoding of XML?

UTF-8 is the default character encoding for XML documents. Character encoding can be studied in our Character Set Tutorial. UTF-8 is also the default encoding for HTML5, CSS, JavaScript, PHP, and SQL.


1 Answers

As you mentioned, you'd have to know the encoding of the file to read the encoding attribute.

However, there is a heuristic that can easily get you close enough to the "real" encoding to allow you to read the encoding attribute. This works, because the <?xml part by definition can only contain characters in the ASCII range (however they are encoded).

The XML standard even describes the exact process used to find out the encoding.

And the encoding label isn't redundant either. For example, if you use the algorithm in the XML spec to find out that some ASCII-based (or ASCII-compatible) encoding is used you still need to read the encoding to find out which one is actually use (valid candidates would be ASCII, UTF-8, any of the ISO-8859-* encodings, any of the Windows-* encodings, KOI8-R and many, many others). For the <?xml part itself it won't make a difference which one it is, but for the rest of the document, it can make a huge difference.

Regarding mis-labeled XML files: yes, it's easy to produce those, however: the XML spec clearly specifies that those files are mal-formed and as such are not correct XML. Incorrect encodings must be reported as an error (as long as they can be detected!). So it's the problem of whoever is producing the XML.

like image 59
Joachim Sauer Avatar answered Oct 11 '22 15:10

Joachim Sauer